Background: Childhood obesity is a growing public health concern with significant implications for physical and mental health. Understanding the multifactorial influences on childhood Body Mass Index (BMI) is crucial for developing effective interventions and policies.
Objective: This study aims to predict childhood BMI Z-scores by integrating comprehensive data on postnatal diet, chemical exposures, and serum metabolomics, using advanced statistical models.
Methods: Data from the HELIX study, including 1301 mother-child pairs aged 6-11 years, were analyzed. Various modeling approaches, including LASSO, ridge, elastic net, decision tree, random forest, and gradient boosting machine (GBM), were employed to predict BMI Z-scores. Models were evaluated based on their Root Mean Squared Error (RMSE) and the significance of the predictors identified.
Results: The results demonstrate that models incorporating a comprehensive set of variables (diet, chemicals, metabolomics, and covariates) consistently outperformed those with fewer variables. Specifically, the Group Lasso model achieved the lowest Root Mean Squared Error (RMSE) of 0.875 with 10-fold cross-validation, followed closely by Group Ridge (RMSE: 0.885) and Group Elastic Net (RMSE: 0.885). Ensemble methods such as Random Forest and GBM also showed robust performance, with GBM yielding an RMSE of 0.954.
Key variables identified across the models include demographic factors such as age and sex, dietary factors like breastfeeding duration and intake of bakery products, and chemical exposures including hs_pcb170_cadj_Log2 and hs_pbde153_cadj_Log2. The addition of metabolomic data significantly enhanced the predictive accuracy, highlighting its potential utility in obesity research.
Conclusion: The findings highlight the critical role of chemical exposures and metabolomic profiles in determining childhood BMI. Incorporating diverse data sources significantly enhances model performance, providing a holistic understanding of the determinants of childhood obesity. Future research should focus on improving data quality, exploring additional covariates, and assessing the generalizability of the models to different populations.
Research indicates that postnatal exposure to endocrine-disrupting chemicals (EDCs) such as phthalates, bisphenol A (BPA), and polychlorinated biphenyls (PCBs) can significantly influence body weight and metabolic health (Junge et al., 2018). These chemicals, commonly found in household products and absorbed through dietary intake, are linked to detrimental effects on body weight and metabolic health in children. This hormonal interference can lead to an increased body mass index (BMI) in children, suggesting a potential pathway through which exposure to these chemicals contributes to the development of obesity.
A longitudinal study on Japanese children examined the impact of postnatal exposure (first two years of life) to p,p’-dichlorodiphenyltrichloroethane (p,p’-DDT) and p,p’-dichlorodiphenyldichloroethylene (p,p’-DDE) through breastfeeding (Plouffe et al., 2020). The findings revealed that higher levels of these chemicals in breast milk were associated with increased BMI at 42 months of age. DDT and DDE may interfere with hormonal pathways related to growth and development. These chemicals can mimic or disrupt hormones that regulate metabolism and fat accumulation. This study highlights the importance of understanding how persistent organic pollutants can affect early childhood growth and development.
The study by Harley et al. (2013) investigates the association between prenatal and postnatal Bisphenol A (BPA) exposure and various body composition metrics in children aged 9 years from the CHAMACOS cohort. The study found that higher prenatal BPA exposure was linked to a decrease in BMI and body fat percentages in girls but not boys, suggesting sex-specific effects. Conversely, BPA levels measured at age 9 were positively associated with increased adiposity in both genders, highlighting the different impacts of exposure timing on childhood development.
The 2022 study 2022 study by Uldbjerg et al. explored the effects of combined exposures to multiple EDCs, suggesting that mixtures of these chemicals can have additive or synergistic effects on BMI and obesity risk. Humans are typically exposed to a mixture of chemicals rather than individual EDCs, making it crucial to understand how these mixtures might interact. The research highlighted that the interaction between different EDCs can lead to additive (where the effects simply add up) or even synergistic (where the combined effect is greater than the sum of their separate effects) outcomes. These interactions can significantly amplify the risk factors associated with obesity and metabolic disorders in children. The dose-response relationship found that even low-level exposure to multiple EDCs could result in significant health impacts due to their combined effects.
These studies collectively illustrate the critical role of environmental exposures in shaping metabolic health outcomes in children, highlighting the necessity for ongoing research and policy intervention to mitigate these risks.
How are postnatal environmental exposures, specifically those found in household products and dietary intake, along with specific serum metabolomics profiles, associated with the BMI Z-score of children aged 6-11 years? Specifically, do higher concentrations of certain serum metabolites, indicative of exposure to chemical classes or metals, correlate with variations in BMI Z-score when controlling for age and other relevant covariates? Additionally, can these metabolites serve as biomarkers for the risk of developing obesity in children?
This study utilizes data from the subcohort of 1301 mother-child pairs in the HELIX study, who are which aged 6-11 years for whom complete exposure and outcome data were available. Exposure data included detailed dietary records after pregnancy and concentrations of various chemicals in child blood samples. There are categorical and numerical variables, which will include both demographic details and biochemical measurements. This dataset allows for robust statistical analysis to identify potential associations between chemical exposure and changes in BMI Z-scores, considering confounding factors such as age, gender, and socioeconomic status. There are no missing data so there is not need to impute the information. Child BMI Z-scores were calculated based on WHO growth standards.
However in terms of the metabolomic data, the values were excluded since there were too many entries. This was preferred over being imputed by mean or median since having any entries that had no metabolomic serum information would have the same values.
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
kable(combined_codebook, align = "c", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
| variable_name | domain | family | subfamily | period | location | period_postnatal | description | var_type | transformation | labels | labelsshort | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| h_bfdur_Ter | h_bfdur_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Breastfeeding duration (weeks) | factor | Tertiles | Breastfeeding | Breastfeeding |
| hs_bakery_prod_Ter | hs_bakery_prod_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: bakery products (hs_cookies + hs_pastries) | factor | Tertiles | Bakery prod | BakeProd |
| hs_beverages_Ter | hs_beverages_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: beverages (hs_dietsoda+hs_soda) | factor | Tertiles | Soda | Soda |
| hs_break_cer_Ter | hs_break_cer_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: breakfast cereal (hs_sugarcer+hs_othcer) | factor | Tertiles | BF cereals | BFcereals |
| hs_caff_drink_Ter | hs_caff_drink_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Drinks a caffeinated or æenergy drink (eg coca-cola, diet-coke, redbull) | factor | Tertiles | Caffeine | Caffeine |
| hs_dairy_Ter | hs_dairy_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: dairy (hs_cheese + hs_milk + hs_yogurt+ hs_probiotic+ hs_desert) | factor | Tertiles | Dairy | Dairy |
| hs_fastfood_Ter | hs_fastfood_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Visits a fast food restaurant/take away | factor | Tertiles | Fastfood | Fastfood |
| hs_KIDMED_None | hs_KIDMED_None | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Sum of KIDMED indices, without index9 | numeric | None | KIDMED | KIDMED |
| hs_mvpa_prd_alt_None | hs_mvpa_prd_alt_None | Lifestyles | Lifestyle | Physical activity | Postnatal | NA | NA | Clean & Over-reporting of Moderate-to-Vigorous Physical Activity (min/day) | numeric | None | PA | PA |
| hs_org_food_Ter | hs_org_food_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Eats organic food | factor | Tertiles | Organicfood | Organicfood |
| hs_proc_meat_Ter | hs_proc_meat_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: processed meat (hs_coldmeat+hs_ham) | factor | Tertiles | Processed meat | ProcMeat |
| hs_readymade_Ter | hs_readymade_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Eats a æready-made supermarket meal | factor | Tertiles | Ready made food | ReadyFood |
| hs_sd_wk_None | hs_sd_wk_None | Lifestyles | Lifestyle | Physical activity | Postnatal | NA | NA | sedentary behaviour (min/day) | numeric | None | Sedentary | Sedentary |
| hs_total_bread_Ter | hs_total_bread_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: bread (hs_darkbread+hs_whbread) | factor | Tertiles | Bread | Bread |
| hs_total_cereal_Ter | hs_total_cereal_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: cereal (hs_darkbread + hs_whbread + hs_rice_pasta + hs_sugarcer + hs_othcer + hs_rusks) | factor | Tertiles | Cereals | Cereals |
| hs_total_fish_Ter | hs_total_fish_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: fish and seafood (hs_canfish+hs_oilyfish+hs_whfish+hs_seafood) | factor | Tertiles | Fish | Fish |
| hs_total_fruits_Ter | hs_total_fruits_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: fruits (hs_canfruit+hs_dryfruit+hs_freshjuice+hs_fruits) | factor | Tertiles | Fruits | Fruits |
| hs_total_lipids_Ter | hs_total_lipids_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: Added fat | factor | Tertiles | Diet fat | Diet fat |
| hs_total_meat_Ter | hs_total_meat_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: meat (hs_coldmeat+hs_ham+hs_poultry+hs_redmeat) | factor | Tertiles | Meat | Meat |
| hs_total_potatoes_Ter | hs_total_potatoes_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: potatoes (hs_frenchfries+hs_potatoes) | factor | Tertiles | Potatoes | Potatoes |
| hs_total_sweets_Ter | hs_total_sweets_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: sweets (hs_choco + hs_sweets + hs_sugar) | factor | Tertiles | Sweets | Sweets |
| hs_total_veg_Ter | hs_total_veg_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: vegetables (hs_cookveg+hs_rawveg) | factor | Tertiles | Vegetables | Vegetables |
| hs_total_yog_Ter | hs_total_yog_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: yogurt (hs_yogurt+hs_probiotic) | factor | Tertiles | Yogurt | Yogurt |
| hs_dif_hours_total_None | hs_dif_hours_total_None | Lifestyles | Lifestyle | Sleep | Postnatal | NA | NA | Total hours of sleep (mean weekdays and night) | numeric | None | Sleep | Sleep |
| hs_as_c_Log2 | hs_as_c_Log2 | Chemicals | Metals | As | Postnatal | NA | NA | Arsenic (As) in child | numeric | Logarithm base 2 | As | As |
| hs_cd_c_Log2 | hs_cd_c_Log2 | Chemicals | Metals | Cd | Postnatal | NA | NA | Cadmium (Cd) in child | numeric | Logarithm base 2 | Cd | Cd |
| hs_co_c_Log2 | hs_co_c_Log2 | Chemicals | Metals | Co | Postnatal | NA | NA | Cobalt (Co) in child | numeric | Logarithm base 2 | Co | Co |
| hs_cs_c_Log2 | hs_cs_c_Log2 | Chemicals | Metals | Cs | Postnatal | NA | NA | Caesium (Cs) in child | numeric | Logarithm base 2 | Cs | Cs |
| hs_cu_c_Log2 | hs_cu_c_Log2 | Chemicals | Metals | Cu | Postnatal | NA | NA | Copper (Cu) in child | numeric | Logarithm base 2 | Cu | Cu |
| hs_hg_c_Log2 | hs_hg_c_Log2 | Chemicals | Metals | Hg | Postnatal | NA | NA | Mercury (Hg) in child | numeric | Logarithm base 2 | Hg | Hg |
| hs_mn_c_Log2 | hs_mn_c_Log2 | Chemicals | Metals | Mn | Postnatal | NA | NA | Manganese (Mn) in child | numeric | Logarithm base 2 | Mn | Mn |
| hs_mo_c_Log2 | hs_mo_c_Log2 | Chemicals | Metals | Mo | Postnatal | NA | NA | Molybdenum (Mo) in child | numeric | Logarithm base 2 | Mo | Mo |
| hs_pb_c_Log2 | hs_pb_c_Log2 | Chemicals | Metals | Pb | Postnatal | NA | NA | Lead (Pb) in child | numeric | Logarithm base 2 | Pb | Pb |
| hs_tl_cdich_None | hs_tl_cdich_None | Chemicals | Metals | Tl | Postnatal | NA | NA | Dichotomous variable of thallium (Tl) in child | factor | None | Tl | Tl |
| hs_dde_cadj_Log2 | hs_dde_cadj_Log2 | Chemicals | Organochlorines | DDE | Postnatal | NA | NA | Dichlorodiphenyldichloroethylene (DDE) in child adjusted for lipids | numeric | Logarithm base 2 | DDE | DDE |
| hs_ddt_cadj_Log2 | hs_ddt_cadj_Log2 | Chemicals | Organochlorines | DDT | Postnatal | NA | NA | Dichlorodiphenyltrichloroethane (DDT) in child adjusted for lipids | numeric | Logarithm base 2 | DDT | DDT |
| hs_hcb_cadj_Log2 | hs_hcb_cadj_Log2 | Chemicals | Organochlorines | HCB | Postnatal | NA | NA | Hexachlorobenzene (HCB) in child adjusted for lipids | numeric | Logarithm base 2 | HCB | HCB |
| hs_pcb118_cadj_Log2 | hs_pcb118_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl -118 (PCB-118) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 118 | PCB118 |
| hs_pcb138_cadj_Log2 | hs_pcb138_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-138 (PCB-138) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 138 | PCB138 |
| hs_pcb153_cadj_Log2 | hs_pcb153_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-153 (PCB-153) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 153 | PCB153 |
| hs_pcb170_cadj_Log2 | hs_pcb170_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-170 (PCB-170) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 170 | PCB170 |
| hs_pcb180_cadj_Log2 | hs_pcb180_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-180 (PCB-180) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 180 | PCB180 |
| hs_sumPCBs5_cadj_Log2 | hs_sumPCBs5_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Sum of PCBs in child adjusted for lipids (4 cohorts) | numeric | Logarithm base 2 | PCBs | SumPCB |
| hs_dep_cadj_Log2 | hs_dep_cadj_Log2 | Chemicals | Organophosphate pesticides | DEP | Postnatal | NA | NA | Diethyl phosphate (DEP) in child adjusted for creatinine | numeric | Logarithm base 2 | DEP | DEP |
| hs_detp_cadj_Log2 | hs_detp_cadj_Log2 | Chemicals | Organophosphate pesticides | DETP | Postnatal | NA | NA | Diethyl thiophosphate (DETP) in child adjusted for creatinine | numeric | Logarithm base 2 | DETP | DETP |
| hs_dmdtp_cdich_None | hs_dmdtp_cdich_None | Chemicals | Organophosphate pesticides | DMDTP | Postnatal | NA | NA | Dichotomous variable of dimethyl dithiophosphate (DMDTP) in child | factor | None | DMDTP | DMDTP |
| hs_dmp_cadj_Log2 | hs_dmp_cadj_Log2 | Chemicals | Organophosphate pesticides | DMP | Postnatal | NA | NA | Dimethyl phosphate (DMP) in child adjusted for creatinine | numeric | Logarithm base 2 | DMP | DMP |
| hs_dmtp_cadj_Log2 | hs_dmtp_cadj_Log2 | Chemicals | Organophosphate pesticides | DMTP | Postnatal | NA | NA | Dimethyl thiophosphate (DMTP) in child adjusted for creatinine | numeric | Logarithm base 2 | DMDTP | DMTP |
| hs_pbde153_cadj_Log2 | hs_pbde153_cadj_Log2 | Chemicals | Polybrominated diphenyl ethers (PBDE) | PBDE153 | Postnatal | NA | NA | Polybrominated diphenyl ether-153 (PBDE-153) in child adjusted for lipids | numeric | Logarithm base 2 | PBDE 153 | PBDE153 |
| hs_pbde47_cadj_Log2 | hs_pbde47_cadj_Log2 | Chemicals | Polybrominated diphenyl ethers (PBDE) | PBDE47 | Postnatal | NA | NA | Polybrominated diphenyl ether-47 (PBDE-47) in child adjusted for lipids | numeric | Logarithm base 2 | PBDE 47 | PBDE47 |
| hs_pfhxs_c_Log2 | hs_pfhxs_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFHXS | Postnatal | NA | NA | Perfluorohexane sulfonate (PFHXS) in child | numeric | Logarithm base 2 | PFHXS | PFHXS |
| hs_pfna_c_Log2 | hs_pfna_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFNA | Postnatal | NA | NA | Perfluorononanoate (PFNA) in child | numeric | Logarithm base 2 | PFNA | PFNA |
| hs_pfoa_c_Log2 | hs_pfoa_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFOA | Postnatal | NA | NA | Perfluorooctanoate (PFOA) in child | numeric | Logarithm base 2 | PFOA | PFOA |
| hs_pfos_c_Log2 | hs_pfos_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFOS | Postnatal | NA | NA | Perfluorooctane sulfonate (PFOS) in child | numeric | Logarithm base 2 | PFOS | PFOS |
| hs_pfunda_c_Log2 | hs_pfunda_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFUNDA | Postnatal | NA | NA | Perfluoroundecanoate (PFUNDA) in child | numeric | Logarithm base 2 | PFUNDA | PFUNDA |
| hs_bpa_cadj_Log2 | hs_bpa_cadj_Log2 | Chemicals | Phenols | BPA | Postnatal | NA | NA | Bisphenol A (BPA) in child adjusted for creatinine | numeric | Logarithm base 2 | BPA | BPA |
| hs_bupa_cadj_Log2 | hs_bupa_cadj_Log2 | Chemicals | Phenols | BUPA | Postnatal | NA | NA | N-Butyl paraben (BUPA) in child adjusted for creatinine | numeric | Logarithm base 2 | BUPA | BUPA |
| hs_etpa_cadj_Log2 | hs_etpa_cadj_Log2 | Chemicals | Phenols | ETPA | Postnatal | NA | NA | Ethyl paraben (ETPA) in child adjusted for creatinine | numeric | Logarithm base 2 | ETPA | ETPA |
| hs_mepa_cadj_Log2 | hs_mepa_cadj_Log2 | Chemicals | Phenols | MEPA | Postnatal | NA | NA | Methyl paraben (MEPA) in child adjusted for creatinine | numeric | Logarithm base 2 | MEPA | MEPA |
| hs_oxbe_cadj_Log2 | hs_oxbe_cadj_Log2 | Chemicals | Phenols | OXBE | Postnatal | NA | NA | Oxybenzone (OXBE) in child adjusted for creatinine | numeric | Logarithm base 2 | OXBE | OXBE |
| hs_prpa_cadj_Log2 | hs_prpa_cadj_Log2 | Chemicals | Phenols | PRPA | Postnatal | NA | NA | Propyl paraben (PRPA) in child adjusted for creatinine | numeric | Logarithm base 2 | PRPA | PRPA |
| hs_trcs_cadj_Log2 | hs_trcs_cadj_Log2 | Chemicals | Phenols | TRCS | Postnatal | NA | NA | Triclosan (TRCS) in child adjusted for creatinine | numeric | Logarithm base 2 | TRCS | TRCS |
| hs_mbzp_cadj_Log2 | hs_mbzp_cadj_Log2 | Chemicals | Phthalates | MBZP | Postnatal | NA | NA | Mono benzyl phthalate (MBzP) in child adjusted for creatinine | numeric | Logarithm base 2 | MBZP | MBZP |
| hs_mecpp_cadj_Log2 | hs_mecpp_cadj_Log2 | Chemicals | Phthalates | MECPP | Postnatal | NA | NA | Mono-2-ethyl 5-carboxypentyl phthalate (MECPP) in child adjusted for creatinine | numeric | Logarithm base 2 | MECPP | MECPP |
| hs_mehhp_cadj_Log2 | hs_mehhp_cadj_Log2 | Chemicals | Phthalates | MEHHP | Postnatal | NA | NA | Mono-2-ethyl-5-hydroxyhexyl phthalate (MEHHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEHHP | MEHHP |
| hs_mehp_cadj_Log2 | hs_mehp_cadj_Log2 | Chemicals | Phthalates | MEHP | Postnatal | NA | NA | Mono-2-ethylhexyl phthalate (MEHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEHP | MEHP |
| hs_meohp_cadj_Log2 | hs_meohp_cadj_Log2 | Chemicals | Phthalates | MEOHP | Postnatal | NA | NA | Mono-2-ethyl-5-oxohexyl phthalate (MEOHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEOHP | MEOHP |
| hs_mep_cadj_Log2 | hs_mep_cadj_Log2 | Chemicals | Phthalates | MEP | Postnatal | NA | NA | Monoethyl phthalate (MEP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEP | MEP |
| hs_mibp_cadj_Log2 | hs_mibp_cadj_Log2 | Chemicals | Phthalates | MIBP | Postnatal | NA | NA | Mono-iso-butyl phthalate (MiBP) in child adjusted for creatinine | numeric | Logarithm base 2 | MIBP | MIBP |
| hs_mnbp_cadj_Log2 | hs_mnbp_cadj_Log2 | Chemicals | Phthalates | MNBP | Postnatal | NA | NA | Mono-n-butyl phthalate (MnBP) in child adjusted for creatinine | numeric | Logarithm base 2 | MNBP | MNBP |
| hs_ohminp_cadj_Log2 | hs_ohminp_cadj_Log2 | Chemicals | Phthalates | OHMiNP | Postnatal | NA | NA | Mono-4-methyl-7-hydroxyoctyl phthalate (OHMiNP) in child adjusted for creatinine | numeric | Logarithm base 2 | OHMiNP | OHMiNP |
| hs_oxominp_cadj_Log2 | hs_oxominp_cadj_Log2 | Chemicals | Phthalates | OXOMINP | Postnatal | NA | NA | Mono-4-methyl-7-oxooctyl phthalate (OXOMiNP) in child adjusted for creatinine | numeric | Logarithm base 2 | OXOMINP | OXOMINP |
| hs_sumDEHP_cadj_Log2 | hs_sumDEHP_cadj_Log2 | Chemicals | Phthalates | DEHP | Postnatal | NA | NA | Sum of DEHP metabolites (µg/g) in child adjusted for creatinine | numeric | Logarithm base 2 | DEHP | SumDEHP |
| FAS_cat_None | FAS_cat_None | Chemicals | Social and economic capital | Economic capital | Postnatal | NA | NA | Family affluence score | factor | None | Family affluence | FamAfl |
| hs_contactfam_3cat_num_None | hs_contactfam_3cat_num_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | scoial capital: family friends | factor | None | Social contact | SocCont |
| hs_hm_pers_None | hs_hm_pers_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | How many people live in your home? | numeric | None | House crowding | HouseCrow |
| hs_participation_3cat_None | hs_participation_3cat_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | social capital: structural | factor | None | Social participation | SocPartic |
| hs_cotinine_cdich_None | hs_cotinine_cdich_None | Chemicals | Tobacco Smoke | Cotinine | Postnatal | NA | NA | Dichotomous variable of cotinine in child | factor | None | Cotinine | Cotinine |
| hs_globalexp2_None | hs_globalexp2_None | Chemicals | Tobacco Smoke | Tobacco Smoke | Postnatal | NA | NA | Global exposure of the child to ETS (2 categories) | factor | None | ETS | ETS |
| hs_smk_parents_None | hs_smk_parents_None | Chemicals | Tobacco Smoke | Tobacco Smoke | Postnatal | NA | NA | Tobacco Smoke status of parents (both) | factor | None | Smoking_parents | SmokPar |
| e3_sex_None | e3_sex_None | Covariates | Covariates | Child covariate | Pregnancy | NA | NA | Child sex (female / male) | factor | None | Child sex | Sex |
| e3_yearbir_None | e3_yearbir_None | Covariates | Covariates | Child covariate | Pregnancy | NA | NA | Year of birth (2003 to 2009) | factor | None | Year of birth | YearBirth |
| hs_child_age_None | hs_child_age_None | Covariates | Covariates | Child covariate | Postnatal | NA | NA | Child age at examination (years) | numeric | None | Child age | cAge |
| hs_zbmi_who | hs_zbmi_who | Phenotype | Phenotype | Outcome at 6-11 years old | Postnatal | NA | NA | Body mass index z-score at 6-11 years old - WHO reference - Standardized on sex and age | numeric | None | Body mass index z-score | zBMI |
These variables were categorized into tertiles to assess the impact of different levels of dietary intake on BMI Z-scores. The dietary intake variables included:
h_bfdur_Ter: Duration of breastfeeding
hs_bakery_prod_Ter: Intake of bakery
products
hs_break_cer_Ter: Intake of breakfast
cereals
hs_dairy_Ter: Intake of dairy products
hs_fastfood_Ter: Intake of fast food
hs_org_food_Ter: Intake of organic food
hs_proc_meat_Ter: Intake of processed meat
hs_total_fish_Ter: Intake of total fish
hs_total_fruits_Ter: Intake of total fruits
hs_total_lipids_Ter: Intake of total lipids
hs_total_sweets_Ter: Intake of total sweets
hs_total_veg_Ter: Intake of total
vegetables
# specific lifestyle exposures
dietary_exposures <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
dietary_exposome <- dplyr::select(exposome, all_of(dietary_exposures))
summarytools::view(dfSummary(dietary_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | h_bfdur_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 2 | hs_bakery_prod_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 3 | hs_break_cer_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 4 | hs_dairy_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 5 | hs_fastfood_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 6 | hs_org_food_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 7 | hs_proc_meat_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 8 | hs_total_fish_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 9 | hs_total_fruits_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 10 | hs_total_lipids_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 11 | hs_total_sweets_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 12 | hs_total_veg_Ter [factor] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-30
categorical_diet <- dietary_exposome %>%
dplyr::select(where(is.factor))
categorical_diet_long <- pivot_longer(
categorical_diet,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_categorical_vars <- unique(categorical_diet_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
data <- filter(categorical_diet_long, variable == var)
p <- ggplot(data, aes(x = value, fill = value)) +
geom_bar(stat = "count") +
labs(title = paste("Distribution of", var), x = var, y = "Count")
print(p)
return(p)
})
Breastfeeding Duration: Majority of observations are in the highest duration category, suggesting longer breastfeeding periods are common.
Bakery Products: Shows a relatively even distribution across the three categories, indicating varied consumption levels of bakery products among participants.
Breakfast Cereal: The highest category of cereal consumption is the most common, suggesting a preference for or greater consumption of cereals.
Dairy: Shows a fairly even distribution across all categories, indicating a uniform consumption pattern of dairy products.
Fast Food: Most participants fall into the middle category, indicating moderate consumption of fast food.
Organic Food: Most participants either consume a lot of or no organic food, with fewer in the middle range.
Processed Meat: Consumption levels are fairly evenly distributed, indicating varied dietary habits regarding processed meats.
Bread: Distribution shows a significant leaning towards higher bread consumption.
Cereal: Even distribution across categories suggests varied cereal consumption habits.
Fish and Seafood: Even distribution across categories, indicating varied consumption of fish and seafood.
Fruits: High fruit consumption is the most common, with fewer participants in the lowest category.
Added Fats: More participants consume added fats at the lowest and highest levels, with fewer in the middle.
Sweets: High consumption of sweets is the most common, indicating a preference for or higher access to sugary foods.
Vegetables: Most participants consume a high amount of vegetables.
This study included a comprehensive assessment of various chemical exposures to understand their impact on childhood BMI Z-scores. The specific chemical exposures selected for analysis were as follows:
hs_cd_c_Log2: Cadmium concentration
hs_co_c_Log2: Cobalt
concentration
hs_cs_c_Log2: Cesium
concentration
hs_cu_c_Log2: Copper
concentration
hs_hg_c_Log2: Mercury
concentration
hs_mo_c_Log2: Molybdenum
concentration
hs_pb_c_Log2: Lead
concentration
hs_dde_cadj_Log2: DDE (a breakdown
product of DDT) concentration, adjusted
hs_pcb153_cadj_Log2: PCB 153
concentration, adjusted
hs_pcb170_cadj_Log2: PCB 170
concentration, adjusted
hs_dep_cadj_Log2: DEP (Diethyl
phthalate) concentration, adjusted
hs_pbde153_cadj_Log2: PBDE 153
(Polybrominated diphenyl ether) concentration, adjusted
hs_pfhxs_c_Log2: PFHxS
(Perfluorohexane sulfonic acid) concentration
hs_pfoa_c_Log2: PFOA
(Perfluorooctanoic acid) concentration
hs_pfos_c_Log2: PFOS
(Perfluorooctane sulfonic acid) concentration
hs_prpa_cadj_Log2: PRPA (Propargyl
alcohol) concentration, adjusted
hs_mbzp_cadj_Log2: MBzP
(Mono-benzyl phthalate) concentration, adjusted
hs_mibp_cadj_Log2: MiBP
(Mono-isobutyl phthalate) concentration, adjusted
hs_mnbp_cadj_Log2: MnBP
(Mono-n-butyl phthalate) concentration, adjusted
# specific chemical exposures
chemical_exposures <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
chemical_exposome <- dplyr::select(exposome, all_of(chemical_exposures))
summarytools::view(dfSummary(chemical_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | hs_cd_c_Log2 [numeric] |
|
695 distinct values | 0 (0.0%) | |||||
| 2 | hs_co_c_Log2 [numeric] |
|
317 distinct values | 0 (0.0%) | |||||
| 3 | hs_cs_c_Log2 [numeric] |
|
369 distinct values | 0 (0.0%) | |||||
| 4 | hs_cu_c_Log2 [numeric] |
|
345 distinct values | 0 (0.0%) | |||||
| 5 | hs_hg_c_Log2 [numeric] |
|
698 distinct values | 0 (0.0%) | |||||
| 6 | hs_mo_c_Log2 [numeric] |
|
593 distinct values | 0 (0.0%) | |||||
| 7 | hs_pb_c_Log2 [numeric] |
|
529 distinct values | 0 (0.0%) | |||||
| 8 | hs_dde_cadj_Log2 [numeric] |
|
1050 distinct values | 0 (0.0%) | |||||
| 9 | hs_pcb153_cadj_Log2 [numeric] |
|
1047 distinct values | 0 (0.0%) | |||||
| 10 | hs_pcb170_cadj_Log2 [numeric] |
|
1039 distinct values | 0 (0.0%) | |||||
| 11 | hs_dep_cadj_Log2 [numeric] |
|
1045 distinct values | 0 (0.0%) | |||||
| 12 | hs_pbde153_cadj_Log2 [numeric] |
|
1036 distinct values | 0 (0.0%) | |||||
| 13 | hs_pfhxs_c_Log2 [numeric] |
|
1061 distinct values | 0 (0.0%) | |||||
| 14 | hs_pfoa_c_Log2 [numeric] |
|
1061 distinct values | 0 (0.0%) | |||||
| 15 | hs_pfos_c_Log2 [numeric] |
|
1050 distinct values | 0 (0.0%) | |||||
| 16 | hs_prpa_cadj_Log2 [numeric] |
|
1031 distinct values | 0 (0.0%) | |||||
| 17 | hs_mbzp_cadj_Log2 [numeric] |
|
1046 distinct values | 0 (0.0%) | |||||
| 18 | hs_mibp_cadj_Log2 [numeric] |
|
1057 distinct values | 0 (0.0%) | |||||
| 19 | hs_mnbp_cadj_Log2 [numeric] |
|
1048 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-30
#separate numeric and categorical data
numeric_chemical <- chemical_exposome %>%
dplyr::select(where(is.numeric))
numeric_chemical_long <- pivot_longer(
numeric_chemical,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_numerical_vars <- unique(numeric_chemical_long$variable)
num_plots <- lapply(unique_numerical_vars, function(var) {
data <- filter(numeric_chemical_long, variable == var)
p <- ggplot(data, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue") +
labs(title = paste("Histogram of", var), x = "Value", y = "Count")
print(p)
return(p)
})
Cadmium (hs_cd_c_Log2): The histogram for cadmium exposure shows a relatively symmetric distribution centered around 4 on the log2 scale. Most values range from approximately 3 to 5, with few outliers.
Cobalt (hs_co_c_Log2): The histogram of cobalt levels displays a roughly normal distribution centered around a slight positive skew, peaking around 3.5.
Cesium (hs_cs_c_Log2): Exhibits a right-skewed distribution, indicating that most participants have relatively low exposure levels, but a small number have substantially higher exposures. Majority of the data centered around 1.5 to 2.5
Copper (hs_cu_c_Log2): Shows a right-skewed distribution, suggesting that while most individuals have moderate exposure, a few experience significantly higher levels of copper.
Mercury (hs_hg_c_Log2): This distribution is also right-skewed, common for environmental pollutants, where a majority have lower exposure levels, and a minority have high exposure levels.
Molybdenum (hs_mo_c_Log2): Shows a distribution with a sharp peak and a long right tail, suggesting that while most people have similar exposure levels, a few have exceptionally high exposures. Has a sharp peak around 6, indicating that most values fall within a narrow range.
Lead (hs_pb_c_Log2): The distribution is slightly right-skewed, indicating higher exposure levels in a smaller group of the population compared to the majority.
DDE (hs_dde_cadj_Log2): Shows a pronounced right skew, typical for chemicals that accumulate in the environment and in human tissues, indicating higher levels of exposure in a smaller subset of the population..
PCB 153 (hs_pcb153_cadj_Log2): Has a distribution with right skewness, suggesting that exposure to these compounds is higher among a smaller segment of the population. Bimodal, indicating two peaks around 2 and 4.
PCB 170 (hs_pcb170_cadj_Log2): This histograms show a significant right skew, indicating lower concentrations of these chemicals in most samples, with fewer samples showing higher concentrations. This pattern suggests that while most individuals have low exposure, a few may have considerably higher levels.
DEP (hs_dep_cadj_Log2): DEP exposure has a sharp peak around 6, indicating a narrow distribution of values.
PBDE 153 (hs_pbde153_cadj_Log2): This histogram shows a bimodal distribution, with peaks around 1 and 4.
PFHxS: Distribution peaks around 5 with a broad spread, indicating variation in PFHxS levels.
PFOA: Shows a peak around 6 with a symmetrical distribution, indicating consistent PFOA levels.
PFOS: Similar to PFOA, the distribution peaks around 6, indicating consistent PFOS levels.
Propyl Paraben (hs_prpa_cadj_Log2): The distribution peaks around 6 with a broad spread, indicating variability in propyl paraben levels.
Monobenzyl Phthalate (hs_mbzp_cadj_Log2): This histogram shows a right-skewed distribution. Most values cluster at the lower end, indicating a common lower exposure level among subjects, with a long tail towards higher values suggesting occasional higher exposures. Shows a broad distribution with a peak around 4, indicating variation in monobenzyl phthalate levels. Indicates consistent but varied exposure levels.
Monoisobutyl Phthalate (hs_mibp_cadj_Log2): The distribution is right-skewed, similar to MBZP, but with a smoother decline. This pattern also indicates that while most subjects have lower exposure levels, a few experience significantly higher exposures.
Mono-n-butyl Phthalate (hs_mnbp_cadj_Log2): Peaks around 4, indicating consistent exposure levels with some variation. Few outliers are present.
Covariates were selected on its impact with the postnatal nature of the child. The only exception is sex and year of birth, which was considered as pregnancy. This were used since there are differences amongst gender as well as depending on the child’s age and when they are born.
# Specified covariates
specific_covariates <- c(
"e3_sex_None",
"e3_yearbir_None",
"hs_child_age_None"
)
covariate_data <- dplyr::select(covariates, all_of(specific_covariates))
summarytools::view(dfSummary(covariate_data, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | e3_sex_None [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 2 | e3_yearbir_None [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 3 | hs_child_age_None [numeric] |
|
879 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-30
#separate numeric and categorical data
numeric_covariates <- covariate_data %>%
dplyr::select(where(is.numeric))
numeric_covariates_long <- pivot_longer(
numeric_covariates,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_numerical_vars <- unique(numeric_covariates_long$variable)
num_plots <- lapply(unique_numerical_vars, function(var) {
data <- filter(numeric_covariates_long, variable == var)
p <- ggplot(data, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue") +
labs(title = paste("Histogram of", var), x = "Value", y = "Count")
print(p)
return(p)
})
Child’s Age (hs_child_age): This histogram is multimodal, reflecting several peaks across different ages. This could be indicative of the data collection points or particular age groups being studied.
categorical_covariates <- covariate_data %>%
dplyr::select(where(is.factor))
categorical_covariates_long <- pivot_longer(
categorical_covariates,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_categorical_vars <- unique(categorical_covariates_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
data <- filter(categorical_covariates_long, variable == var)
p <- ggplot(data, aes(x = value, fill = value)) +
geom_bar(stat = "count") +
labs(title = paste("Distribution of", var), x = var, y = "Count")
print(p)
return(p)
})
Gender Distribution (e3_sex): The gender distribution is nearly balanced with a slight higher count for males compared to females.
Year of Birth (e3_yearbir): This chart shows that the majority of subjects were born in the later years, with a significant increase in 2009, indicating perhaps a larger recruitment or a specific cohort focus that year.
The primary outcome of interest in this analysis is the Body Mass Index (BMI) Z-score of children, which is a measure of relative weight adjusted for child age and sex. The BMI Z-score provides a standardized way to compare a child’s BMI with a reference population
outcome_BMI <- phenotype %>%
dplyr::select(hs_zbmi_who)
summarytools::view(dfSummary(outcome_BMI, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | hs_zbmi_who [numeric] |
|
421 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-30
combined_data <- cbind(covariate_data, dietary_exposome, chemical_exposome, outcome_BMI)
combined_data <- combined_data[, !duplicated(colnames(combined_data))]
# sex variable to a factor for stratification
combined_data$e3_sex_None <- as.factor(combined_data$e3_sex_None)
levels(combined_data$e3_sex_None) <- c("Male", "Female")
render_cont <- function(x) {
with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}
render_cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}
table1_formula <- ~
hs_child_age_None + e3_yearbir_None +
hs_zbmi_who +
h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
hs_proc_meat_Ter +
hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_sex_None
table1(
table1_formula,
data = combined_data,
render.continuous = render_cont,
render.categorical = render_cat,
overall = TRUE,
topclass = "Rtable1-shade"
)
| Male (N=608) |
Female (N=693) |
TRUE (N=1301) |
|
|---|---|---|---|
| hs_child_age_None | 7.91 (1.58) | 8.03 (1.64) | 7.98 (1.61) |
| e3_yearbir_None | |||
| 2003 | 25 (4.1 %) | 30 (4.3 %) | 55 (4.2 %) |
| 2004 | 46 (7.6 %) | 61 (8.8 %) | 107 (8.2 %) |
| 2005 | 121 (19.9 %) | 120 (17.3 %) | 241 (18.5 %) |
| 2006 | 108 (17.8 %) | 148 (21.4 %) | 256 (19.7 %) |
| 2007 | 128 (21.1 %) | 122 (17.6 %) | 250 (19.2 %) |
| 2008 | 177 (29.1 %) | 202 (29.1 %) | 379 (29.1 %) |
| 2009 | 3 (0.5 %) | 10 (1.4 %) | 13 (1.0 %) |
| hs_zbmi_who | 0.35 (1.15) | 0.45 (1.22) | 0.40 (1.19) |
| h_bfdur_Ter | |||
| (0,10.8] | 231 (38.0 %) | 275 (39.7 %) | 506 (38.9 %) |
| (10.8,34.9] | 118 (19.4 %) | 152 (21.9 %) | 270 (20.8 %) |
| (34.9,Inf] | 259 (42.6 %) | 266 (38.4 %) | 525 (40.4 %) |
| hs_bakery_prod_Ter | |||
| (0,2] | 164 (27.0 %) | 181 (26.1 %) | 345 (26.5 %) |
| (2,6] | 188 (30.9 %) | 235 (33.9 %) | 423 (32.5 %) |
| (6,Inf] | 256 (42.1 %) | 277 (40.0 %) | 533 (41.0 %) |
| hs_break_cer_Ter | |||
| (0,1.1] | 141 (23.2 %) | 150 (21.6 %) | 291 (22.4 %) |
| (1.1,5.5] | 251 (41.3 %) | 270 (39.0 %) | 521 (40.0 %) |
| (5.5,Inf] | 216 (35.5 %) | 273 (39.4 %) | 489 (37.6 %) |
| hs_dairy_Ter | |||
| (0,14.6] | 175 (28.8 %) | 184 (26.6 %) | 359 (27.6 %) |
| (14.6,25.6] | 229 (37.7 %) | 236 (34.1 %) | 465 (35.7 %) |
| (25.6,Inf] | 204 (33.6 %) | 273 (39.4 %) | 477 (36.7 %) |
| hs_fastfood_Ter | |||
| (0,0.132] | 75 (12.3 %) | 68 (9.8 %) | 143 (11.0 %) |
| (0.132,0.5] | 273 (44.9 %) | 330 (47.6 %) | 603 (46.3 %) |
| (0.5,Inf] | 260 (42.8 %) | 295 (42.6 %) | 555 (42.7 %) |
| hs_org_food_Ter | |||
| (0,0.132] | 211 (34.7 %) | 218 (31.5 %) | 429 (33.0 %) |
| (0.132,1] | 191 (31.4 %) | 205 (29.6 %) | 396 (30.4 %) |
| (1,Inf] | 206 (33.9 %) | 270 (39.0 %) | 476 (36.6 %) |
| hs_proc_meat_Ter | |||
| (0,1.5] | 175 (28.8 %) | 191 (27.6 %) | 366 (28.1 %) |
| (1.5,4] | 227 (37.3 %) | 244 (35.2 %) | 471 (36.2 %) |
| (4,Inf] | 206 (33.9 %) | 258 (37.2 %) | 464 (35.7 %) |
| hs_total_fish_Ter | |||
| (0,1.5] | 183 (30.1 %) | 206 (29.7 %) | 389 (29.9 %) |
| (1.5,3] | 224 (36.8 %) | 230 (33.2 %) | 454 (34.9 %) |
| (3,Inf] | 201 (33.1 %) | 257 (37.1 %) | 458 (35.2 %) |
| hs_total_fruits_Ter | |||
| (0,7] | 174 (28.6 %) | 239 (34.5 %) | 413 (31.7 %) |
| (7,14.1] | 216 (35.5 %) | 191 (27.6 %) | 407 (31.3 %) |
| (14.1,Inf] | 218 (35.9 %) | 263 (38.0 %) | 481 (37.0 %) |
| hs_total_lipids_Ter | |||
| (0,3] | 193 (31.7 %) | 204 (29.4 %) | 397 (30.5 %) |
| (3,7] | 171 (28.1 %) | 232 (33.5 %) | 403 (31.0 %) |
| (7,Inf] | 244 (40.1 %) | 257 (37.1 %) | 501 (38.5 %) |
| hs_total_sweets_Ter | |||
| (0,4.1] | 149 (24.5 %) | 195 (28.1 %) | 344 (26.4 %) |
| (4.1,8.5] | 251 (41.3 %) | 265 (38.2 %) | 516 (39.7 %) |
| (8.5,Inf] | 208 (34.2 %) | 233 (33.6 %) | 441 (33.9 %) |
| hs_total_veg_Ter | |||
| (0,6] | 190 (31.2 %) | 214 (30.9 %) | 404 (31.1 %) |
| (6,8.5] | 136 (22.4 %) | 178 (25.7 %) | 314 (24.1 %) |
| (8.5,Inf] | 282 (46.4 %) | 301 (43.4 %) | 583 (44.8 %) |
| hs_cd_c_Log2 | -3.99 (0.98) | -3.95 (1.09) | -3.97 (1.04) |
| hs_co_c_Log2 | -2.37 (0.61) | -2.32 (0.64) | -2.34 (0.63) |
| hs_cs_c_Log2 | 0.44 (0.58) | 0.44 (0.57) | 0.44 (0.57) |
| hs_cu_c_Log2 | 9.81 (0.25) | 9.84 (0.22) | 9.83 (0.23) |
| hs_hg_c_Log2 | -0.24 (1.59) | -0.35 (1.75) | -0.30 (1.68) |
| hs_mo_c_Log2 | -0.32 (0.83) | -0.31 (0.96) | -0.32 (0.90) |
| hs_dde_cadj_Log2 | 4.63 (1.48) | 4.70 (1.50) | 4.67 (1.49) |
| hs_pcb153_cadj_Log2 | 3.47 (0.86) | 3.63 (0.94) | 3.56 (0.90) |
| hs_pcb170_cadj_Log2 | -0.60 (3.22) | -0.05 (2.77) | -0.31 (3.00) |
| hs_dep_cadj_Log2 | 0.27 (3.16) | 0.06 (3.25) | 0.16 (3.21) |
| hs_pbde153_cadj_Log2 | -4.66 (3.86) | -4.40 (3.80) | -4.53 (3.83) |
| hs_pfhxs_c_Log2 | -1.62 (1.30) | -1.53 (1.31) | -1.57 (1.31) |
| hs_pfoa_c_Log2 | 0.60 (0.55) | 0.62 (0.56) | 0.61 (0.55) |
| hs_pfos_c_Log2 | 0.95 (1.15) | 0.99 (1.08) | 0.97 (1.11) |
| hs_prpa_cadj_Log2 | -1.26 (3.96) | -1.91 (3.68) | -1.61 (3.82) |
| hs_mbzp_cadj_Log2 | 2.42 (1.23) | 2.47 (1.22) | 2.44 (1.22) |
| hs_mibp_cadj_Log2 | 5.54 (1.09) | 5.39 (1.12) | 5.46 (1.11) |
| hs_mnbp_cadj_Log2 | 4.77 (1.08) | 4.60 (0.96) | 4.68 (1.02) |
combined_data$e3_yearbir_None <- as.factor(combined_data$e3_yearbir_None)
render_cont <- function(x) {
with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}
render_cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}
table1_formula_year <- ~
hs_child_age_None + e3_sex_None +
hs_zbmi_who +
h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
hs_proc_meat_Ter +
hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_yearbir_None
table1(
table1_formula_year,
data = combined_data,
render.continuous = render_cont,
render.categorical = render_cat,
overall = TRUE,
topclass = "Rtable1-shade"
)
| 2003 (N=55) |
2004 (N=107) |
2005 (N=241) |
2006 (N=256) |
2007 (N=250) |
2008 (N=379) |
2009 (N=13) |
TRUE (N=1301) |
|
|---|---|---|---|---|---|---|---|---|
| hs_child_age_None | 11.32 (0.47) | 10.81 (0.38) | 9.19 (0.57) | 8.38 (0.41) | 6.91 (0.51) | 6.41 (0.31) | 6.18 (0.15) | 7.98 (1.61) |
| e3_sex_None | ||||||||
| Male | 25 (45.5 %) | 46 (43.0 %) | 121 (50.2 %) | 108 (42.2 %) | 128 (51.2 %) | 177 (46.7 %) | 3 (23.1 %) | 608 (46.7 %) |
| Female | 30 (54.5 %) | 61 (57.0 %) | 120 (49.8 %) | 148 (57.8 %) | 122 (48.8 %) | 202 (53.3 %) | 10 (76.9 %) | 693 (53.3 %) |
| hs_zbmi_who | 0.32 (1.15) | 0.12 (1.17) | 0.45 (1.11) | 0.34 (1.11) | 0.41 (1.22) | 0.49 (1.28) | 0.79 (1.14) | 0.40 (1.19) |
| h_bfdur_Ter | ||||||||
| (0,10.8] | 35 (63.6 %) | 65 (60.7 %) | 92 (38.2 %) | 81 (31.6 %) | 90 (36.0 %) | 138 (36.4 %) | 5 (38.5 %) | 506 (38.9 %) |
| (10.8,34.9] | 15 (27.3 %) | 28 (26.2 %) | 69 (28.6 %) | 42 (16.4 %) | 38 (15.2 %) | 75 (19.8 %) | 3 (23.1 %) | 270 (20.8 %) |
| (34.9,Inf] | 5 (9.1 %) | 14 (13.1 %) | 80 (33.2 %) | 133 (52.0 %) | 122 (48.8 %) | 166 (43.8 %) | 5 (38.5 %) | 525 (40.4 %) |
| hs_bakery_prod_Ter | ||||||||
| (0,2] | 16 (29.1 %) | 21 (19.6 %) | 88 (36.5 %) | 117 (45.7 %) | 48 (19.2 %) | 53 (14.0 %) | 2 (15.4 %) | 345 (26.5 %) |
| (2,6] | 12 (21.8 %) | 30 (28.0 %) | 75 (31.1 %) | 90 (35.2 %) | 77 (30.8 %) | 132 (34.8 %) | 7 (53.8 %) | 423 (32.5 %) |
| (6,Inf] | 27 (49.1 %) | 56 (52.3 %) | 78 (32.4 %) | 49 (19.1 %) | 125 (50.0 %) | 194 (51.2 %) | 4 (30.8 %) | 533 (41.0 %) |
| hs_break_cer_Ter | ||||||||
| (0,1.1] | 16 (29.1 %) | 38 (35.5 %) | 60 (24.9 %) | 62 (24.2 %) | 39 (15.6 %) | 70 (18.5 %) | 6 (46.2 %) | 291 (22.4 %) |
| (1.1,5.5] | 19 (34.5 %) | 36 (33.6 %) | 103 (42.7 %) | 103 (40.2 %) | 103 (41.2 %) | 153 (40.4 %) | 4 (30.8 %) | 521 (40.0 %) |
| (5.5,Inf] | 20 (36.4 %) | 33 (30.8 %) | 78 (32.4 %) | 91 (35.5 %) | 108 (43.2 %) | 156 (41.2 %) | 3 (23.1 %) | 489 (37.6 %) |
| hs_dairy_Ter | ||||||||
| (0,14.6] | 15 (27.3 %) | 21 (19.6 %) | 67 (27.8 %) | 58 (22.7 %) | 73 (29.2 %) | 119 (31.4 %) | 6 (46.2 %) | 359 (27.6 %) |
| (14.6,25.6] | 12 (21.8 %) | 27 (25.2 %) | 90 (37.3 %) | 101 (39.5 %) | 80 (32.0 %) | 149 (39.3 %) | 6 (46.2 %) | 465 (35.7 %) |
| (25.6,Inf] | 28 (50.9 %) | 59 (55.1 %) | 84 (34.9 %) | 97 (37.9 %) | 97 (38.8 %) | 111 (29.3 %) | 1 (7.7 %) | 477 (36.7 %) |
| hs_fastfood_Ter | ||||||||
| (0,0.132] | 6 (10.9 %) | 14 (13.1 %) | 20 (8.3 %) | 21 (8.2 %) | 22 (8.8 %) | 57 (15.0 %) | 3 (23.1 %) | 143 (11.0 %) |
| (0.132,0.5] | 38 (69.1 %) | 45 (42.1 %) | 140 (58.1 %) | 153 (59.8 %) | 94 (37.6 %) | 129 (34.0 %) | 4 (30.8 %) | 603 (46.3 %) |
| (0.5,Inf] | 11 (20.0 %) | 48 (44.9 %) | 81 (33.6 %) | 82 (32.0 %) | 134 (53.6 %) | 193 (50.9 %) | 6 (46.2 %) | 555 (42.7 %) |
| hs_org_food_Ter | ||||||||
| (0,0.132] | 17 (30.9 %) | 30 (28.0 %) | 77 (32.0 %) | 51 (19.9 %) | 96 (38.4 %) | 155 (40.9 %) | 3 (23.1 %) | 429 (33.0 %) |
| (0.132,1] | 20 (36.4 %) | 39 (36.4 %) | 78 (32.4 %) | 99 (38.7 %) | 68 (27.2 %) | 87 (23.0 %) | 5 (38.5 %) | 396 (30.4 %) |
| (1,Inf] | 18 (32.7 %) | 38 (35.5 %) | 86 (35.7 %) | 106 (41.4 %) | 86 (34.4 %) | 137 (36.1 %) | 5 (38.5 %) | 476 (36.6 %) |
| hs_proc_meat_Ter | ||||||||
| (0,1.5] | 6 (10.9 %) | 26 (24.3 %) | 38 (15.8 %) | 37 (14.5 %) | 91 (36.4 %) | 162 (42.7 %) | 6 (46.2 %) | 366 (28.1 %) |
| (1.5,4] | 36 (65.5 %) | 41 (38.3 %) | 80 (33.2 %) | 92 (35.9 %) | 83 (33.2 %) | 134 (35.4 %) | 5 (38.5 %) | 471 (36.2 %) |
| (4,Inf] | 13 (23.6 %) | 40 (37.4 %) | 123 (51.0 %) | 127 (49.6 %) | 76 (30.4 %) | 83 (21.9 %) | 2 (15.4 %) | 464 (35.7 %) |
| hs_total_fish_Ter | ||||||||
| (0,1.5] | 11 (20.0 %) | 20 (18.7 %) | 32 (13.3 %) | 36 (14.1 %) | 106 (42.4 %) | 180 (47.5 %) | 4 (30.8 %) | 389 (29.9 %) |
| (1.5,3] | 31 (56.4 %) | 56 (52.3 %) | 80 (33.2 %) | 70 (27.3 %) | 82 (32.8 %) | 128 (33.8 %) | 7 (53.8 %) | 454 (34.9 %) |
| (3,Inf] | 13 (23.6 %) | 31 (29.0 %) | 129 (53.5 %) | 150 (58.6 %) | 62 (24.8 %) | 71 (18.7 %) | 2 (15.4 %) | 458 (35.2 %) |
| hs_total_fruits_Ter | ||||||||
| (0,7] | 33 (60.0 %) | 58 (54.2 %) | 73 (30.3 %) | 55 (21.5 %) | 68 (27.2 %) | 119 (31.4 %) | 7 (53.8 %) | 413 (31.7 %) |
| (7,14.1] | 13 (23.6 %) | 24 (22.4 %) | 88 (36.5 %) | 76 (29.7 %) | 81 (32.4 %) | 123 (32.5 %) | 2 (15.4 %) | 407 (31.3 %) |
| (14.1,Inf] | 9 (16.4 %) | 25 (23.4 %) | 80 (33.2 %) | 125 (48.8 %) | 101 (40.4 %) | 137 (36.1 %) | 4 (30.8 %) | 481 (37.0 %) |
| hs_total_lipids_Ter | ||||||||
| (0,3] | 9 (16.4 %) | 18 (16.8 %) | 97 (40.2 %) | 82 (32.0 %) | 67 (26.8 %) | 122 (32.2 %) | 2 (15.4 %) | 397 (30.5 %) |
| (3,7] | 26 (47.3 %) | 45 (42.1 %) | 71 (29.5 %) | 59 (23.0 %) | 80 (32.0 %) | 116 (30.6 %) | 6 (46.2 %) | 403 (31.0 %) |
| (7,Inf] | 20 (36.4 %) | 44 (41.1 %) | 73 (30.3 %) | 115 (44.9 %) | 103 (41.2 %) | 141 (37.2 %) | 5 (38.5 %) | 501 (38.5 %) |
| hs_total_sweets_Ter | ||||||||
| (0,4.1] | 16 (29.1 %) | 17 (15.9 %) | 87 (36.1 %) | 96 (37.5 %) | 51 (20.4 %) | 74 (19.5 %) | 3 (23.1 %) | 344 (26.4 %) |
| (4.1,8.5] | 13 (23.6 %) | 38 (35.5 %) | 99 (41.1 %) | 103 (40.2 %) | 104 (41.6 %) | 157 (41.4 %) | 2 (15.4 %) | 516 (39.7 %) |
| (8.5,Inf] | 26 (47.3 %) | 52 (48.6 %) | 55 (22.8 %) | 57 (22.3 %) | 95 (38.0 %) | 148 (39.1 %) | 8 (61.5 %) | 441 (33.9 %) |
| hs_total_veg_Ter | ||||||||
| (0,6] | 19 (34.5 %) | 26 (24.3 %) | 75 (31.1 %) | 63 (24.6 %) | 81 (32.4 %) | 136 (35.9 %) | 4 (30.8 %) | 404 (31.1 %) |
| (6,8.5] | 11 (20.0 %) | 25 (23.4 %) | 53 (22.0 %) | 71 (27.7 %) | 55 (22.0 %) | 95 (25.1 %) | 4 (30.8 %) | 314 (24.1 %) |
| (8.5,Inf] | 25 (45.5 %) | 56 (52.3 %) | 113 (46.9 %) | 122 (47.7 %) | 114 (45.6 %) | 148 (39.1 %) | 5 (38.5 %) | 583 (44.8 %) |
| hs_cd_c_Log2 | -4.32 (1.45) | -3.93 (1.01) | -3.90 (1.15) | -3.91 (1.00) | -3.90 (0.88) | -4.05 (0.97) | -4.09 (1.85) | -3.97 (1.04) |
| hs_co_c_Log2 | -2.35 (0.51) | -2.38 (0.59) | -2.54 (0.60) | -2.45 (0.68) | -2.27 (0.58) | -2.20 (0.62) | -2.11 (0.55) | -2.34 (0.63) |
| hs_cs_c_Log2 | 1.04 (0.47) | 1.03 (0.50) | 0.71 (0.43) | 0.65 (0.43) | 0.18 (0.50) | 0.07 (0.45) | -0.04 (0.40) | 0.44 (0.57) |
| hs_cu_c_Log2 | 9.83 (0.18) | 9.89 (0.30) | 9.79 (0.21) | 9.76 (0.22) | 9.82 (0.22) | 9.88 (0.23) | 9.86 (0.26) | 9.83 (0.23) |
| hs_hg_c_Log2 | 0.56 (1.21) | 0.73 (1.27) | 0.41 (1.35) | 0.12 (1.35) | -0.78 (1.73) | -1.11 (1.70) | -0.75 (1.41) | -0.30 (1.68) |
| hs_mo_c_Log2 | -0.85 (1.80) | -0.43 (0.64) | -0.45 (0.83) | -0.32 (0.81) | -0.26 (0.83) | -0.16 (0.88) | -0.27 (0.94) | -0.32 (0.90) |
| hs_dde_cadj_Log2 | 4.06 (1.34) | 3.90 (1.09) | 4.23 (1.24) | 4.32 (1.02) | 4.96 (1.66) | 5.28 (1.62) | 5.16 (1.23) | 4.67 (1.49) |
| hs_pcb153_cadj_Log2 | 3.54 (0.73) | 3.47 (0.72) | 3.80 (0.84) | 4.03 (0.79) | 3.32 (0.95) | 3.25 (0.87) | 3.69 (0.96) | 3.56 (0.90) |
| hs_pcb170_cadj_Log2 | 0.48 (1.19) | 0.17 (2.28) | 0.64 (2.35) | 1.08 (1.77) | -1.46 (3.58) | -1.32 (3.29) | -0.92 (3.65) | -0.31 (3.00) |
| hs_dep_cadj_Log2 | -0.62 (3.08) | 0.10 (3.10) | 0.13 (3.31) | 0.19 (2.90) | 0.65 (3.10) | -0.02 (3.42) | -0.25 (3.57) | 0.16 (3.21) |
| hs_pbde153_cadj_Log2 | -4.43 (3.35) | -5.26 (3.63) | -4.26 (3.77) | -3.68 (3.60) | -4.30 (3.95) | -5.20 (3.93) | -5.08 (3.90) | -4.53 (3.83) |
| hs_pfhxs_c_Log2 | -0.58 (0.66) | -0.53 (0.85) | -1.01 (0.99) | -1.07 (0.88) | -2.03 (1.40) | -2.37 (1.22) | -2.68 (0.71) | -1.57 (1.31) |
| hs_pfoa_c_Log2 | 0.53 (0.44) | 0.55 (0.53) | 0.67 (0.49) | 0.64 (0.51) | 0.66 (0.63) | 0.55 (0.58) | 0.35 (0.50) | 0.61 (0.55) |
| hs_pfos_c_Log2 | 1.67 (0.68) | 1.55 (0.72) | 1.18 (1.00) | 1.10 (1.14) | 0.80 (1.05) | 0.62 (1.17) | 0.40 (0.94) | 0.97 (1.11) |
| hs_prpa_cadj_Log2 | -3.21 (3.68) | -2.67 (3.04) | -0.92 (3.92) | -1.65 (3.89) | -1.57 (3.80) | -1.58 (3.80) | 0.87 (4.50) | -1.61 (3.82) |
| hs_mbzp_cadj_Log2 | 2.69 (1.08) | 2.84 (1.11) | 2.43 (1.19) | 2.32 (1.14) | 2.36 (1.34) | 2.45 (1.24) | 2.21 (1.51) | 2.44 (1.22) |
| hs_mibp_cadj_Log2 | 5.19 (1.08) | 5.59 (1.07) | 4.84 (0.97) | 4.82 (0.98) | 5.84 (1.00) | 6.02 (0.93) | 6.27 (0.97) | 5.46 (1.11) |
| hs_mnbp_cadj_Log2 | 3.99 (0.86) | 4.34 (0.88) | 4.26 (0.90) | 4.52 (0.90) | 4.85 (0.95) | 5.11 (1.05) | 5.22 (0.77) | 4.68 (1.02) |
When selecting variables, elastic net will be applied into the available diet and chemical variables in the HELIX data. Elastic net was utilized for variable selection and further analysis.
outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
#the full chemicals list
chemicals_full <- c(
"hs_as_c_Log2",
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mn_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_tl_cdich_None",
"hs_dde_cadj_Log2",
"hs_ddt_cadj_Log2",
"hs_hcb_cadj_Log2",
"hs_pcb118_cadj_Log2",
"hs_pcb138_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_pcb180_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_dmdtp_cdich_None",
"hs_dmp_cadj_Log2",
"hs_dmtp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pbde47_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfna_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_pfunda_c_Log2",
"hs_bpa_cadj_Log2",
"hs_bupa_cadj_Log2",
"hs_etpa_cadj_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_trcs_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mecpp_cadj_Log2",
"hs_mehhp_cadj_Log2",
"hs_mehp_cadj_Log2",
"hs_meohp_cadj_Log2",
"hs_mep_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2",
"hs_ohminp_cadj_Log2",
"hs_oxominp_cadj_Log2",
"hs_cotinine_cdich_None",
"hs_globalexp2_None"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_beverages_Ter",
"hs_break_cer_Ter",
"hs_caff_drink_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_cereal_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_meat_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter",
"hs_total_yog_Ter"
)
chemicals_columns <- c(chemicals_full)
all_chemicals <- exposome %>% dplyr::select(all_of(chemicals_columns))
diet_columns <- c(postnatal_diet)
all_diet <- exposome %>% dplyr::select(all_of(diet_columns))
all_columns <- c(chemicals_full, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
chemicals_outcome_cov <- cbind(outcome_cov, all_chemicals)
diet_outcome_cov <- cbind(outcome_cov, all_diet)
interested_data <- cbind(outcome_cov, extracted_exposome)
head(interested_data)
Chemicals will be analyzed for the best variables using enet methods.
# train/test 70-30
set.seed(101)
train_indices <- sample(seq_len(nrow(chemicals_outcome_cov)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(chemicals_outcome_cov)), train_indices)
x_train <- as.matrix(chemicals_outcome_cov[train_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_train <- chemicals_outcome_cov$hs_zbmi_who[train_indices]
x_test <- as.matrix(chemicals_outcome_cov[test_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_test <- chemicals_outcome_cov$hs_zbmi_who[test_indices]
x_train_chemicals_only <- as.matrix(chemicals_outcome_cov[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(chemicals_outcome_cov[test_indices, chemicals_full])
# ELASTIC NET
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
test_rmse_without_covariates <- sqrt(test_mse_without_covariates)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")
best_lambda <- fit_without_covariates_train$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.785950188
## hs_as_c_Log2 .
## hs_cd_c_Log2 -0.025843356
## hs_co_c_Log2 -0.005835867
## hs_cs_c_Log2 0.084715330
## hs_cu_c_Log2 0.607379616
## hs_hg_c_Log2 -0.009800093
## hs_mn_c_Log2 .
## hs_mo_c_Log2 -0.099724922
## hs_pb_c_Log2 -0.010318890
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.039528137
## hs_ddt_cadj_Log2 .
## hs_hcb_cadj_Log2 .
## hs_pcb118_cadj_Log2 .
## hs_pcb138_cadj_Log2 .
## hs_pcb153_cadj_Log2 -0.169008355
## hs_pcb170_cadj_Log2 -0.055808065
## hs_pcb180_cadj_Log2 .
## hs_dep_cadj_Log2 -0.019034348
## hs_detp_cadj_Log2 .
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 .
## hs_dmtp_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.035464586
## hs_pbde47_cadj_Log2 .
## hs_pfhxs_c_Log2 -0.006816020
## hs_pfna_c_Log2 .
## hs_pfoa_c_Log2 -0.135997766
## hs_pfos_c_Log2 -0.047692264
## hs_pfunda_c_Log2 .
## hs_bpa_cadj_Log2 .
## hs_bupa_cadj_Log2 .
## hs_etpa_cadj_Log2 .
## hs_mepa_cadj_Log2 .
## hs_oxbe_cadj_Log2 0.002529961
## hs_prpa_cadj_Log2 0.001735800
## hs_trcs_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.040317847
## hs_mecpp_cadj_Log2 .
## hs_mehhp_cadj_Log2 .
## hs_mehp_cadj_Log2 .
## hs_meohp_cadj_Log2 .
## hs_mep_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.047892677
## hs_mnbp_cadj_Log2 -0.008483913
## hs_ohminp_cadj_Log2 .
## hs_oxominp_cadj_Log2 .
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test RMSE:", test_rmse_without_covariates)
## Model without Covariates - Test RMSE: 1.108515
#selected chemicals that were noted in enet
chemicals_selected <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2")
The features for chemicals were selected due to the feature selections of elastic net. LASSO might simplify the dimensionality, so elastic net was chosen since feature importance is uncertain.
Like with the chemical variables, the postnatal diet of children will be analyzed for the best variables using the regression methods.
# train/test
set.seed(101)
train_indices <- sample(seq_len(nrow(diet_outcome_cov)), size = floor(0.7 * nrow(diet_outcome_cov)))
test_indices <- setdiff(seq_len(nrow(diet_outcome_cov)), train_indices)
diet_data <- diet_outcome_cov[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])
covariates <- diet_outcome_cov[, c("e3_sex_None", "e3_yearbir_None", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ])
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])
x_full_train <- cbind(x_diet_train, x_covariates_train)
x_full_test <- cbind(x_diet_test, x_covariates_test)
x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_diet_train[is.na(x_diet_train)] <- 0
x_diet_test[is.na(x_diet_test)] <- 0
y_train <- as.numeric(diet_outcome_cov$hs_zbmi_who[train_indices])
y_test <- as.numeric(diet_outcome_cov$hs_zbmi_who[test_indices])
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates
##
## Call: cv.glmnet(x = x_diet_train, y = y_train, alpha = 0.5, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.1384 9 1.431 0.06031 5
## 1se 0.2914 1 1.442 0.06168 0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.52746936
## h_bfdur_Ter(0,10.8] .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_beverages_Ter(0.132,1] .
## hs_beverages_Ter(1,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_caff_drink_Ter(0.132,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] -0.12911952
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_readymade_Ter(0.132,0.5] .
## hs_readymade_Ter(0.5,Inf] .
## hs_total_bread_Ter(7,17.5] .
## hs_total_bread_Ter(17.5,Inf] .
## hs_total_cereal_Ter(14.1,23.6] .
## hs_total_cereal_Ter(23.6,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] -0.02484832
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] -0.05017966
## hs_total_meat_Ter(6,9] .
## hs_total_meat_Ter(9,Inf] .
## hs_total_potatoes_Ter(3,4] .
## hs_total_potatoes_Ter(4,Inf] .
## hs_total_sweets_Ter(4.1,8.5] -0.01453272
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] -0.07850581
## hs_total_yog_Ter(6,8.5] .
## hs_total_yog_Ter(8.5,Inf] .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
rmse_without_covariates <- sqrt(mse_without_covariates)
cat("Model without Covariates - Test RMSE:", rmse_without_covariates)
## Model without Covariates - Test RMSE: 1.161644
#selected diets that were noted in enet
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
To analyze the impact of dietary, chemical, and demographic variables on BMI Z-scores, various statistical models were employed, each chosen for their unique strengths in handling different aspects of the data. These models were analyzed and performed with a 70-30% split in order to train models and predict the performance
LASSO (Least Absolute Shrinkage and Selection Operator) applies L1 regularization to minimize the absolute sum of coefficients. It was used to perform variable selection and regularization, effectively identifying the most significant predictors while setting less important ones to zero.
Ridge regression is particularly useful for handling multicollinearity among predictors, ensuring that all variables contribute to the prediction without being overly penalized.
Elastic net balances the benefits of both LASSO ridge, so by handling both variable selection and multicollinearity, elastic net is well-suited for high-dimensional datasets where predictors are correlated.
Since the data gets correlated with each other when combined (adding diet, chemicals and the metabolomic serum), group LASSO, ridge, and elastic net were applied.
Decision tree (with pruning) splits the data into subsets based on the value of input features, with pruning applied to prevent overfitting. This provides an interpretable structure for understanding the relationships between variables and the outcome, though it can be prone to overfitting without pruning. Pruning gets applied to enhance generalizability.
Random forest constructs multiple decision trees and merges them to obtain a more accurate and stable prediction. By averaging the predictions from numerous trees, this model reduces overfitting and captures complex interactions among variables.
Gradient Boosting Machine (GBM) builds an ensemble of trees in a sequential manner, where each tree corrects the errors of its predecessors. This approach is highly effective in improving predictive accuracy by focusing on the residuals of previous trees, making it powerful for capturing non-linear relationships.
Thes ensemble methods (razndom forest and GBM) were chosen for their robustness and high predictive accuracy. Random forest reduces variance by averaging multiple trees, while GBM improves model performance through iterative refinement.
In this study, the Root Mean Squared Error (RMSE) is used as the primary performance metric to evaluate and compare the predictive models. Using RMSE allows for straightforward comparisons between different models and datasets. Since RMSE is in the same units as the outcome variable, it facilitates direct assessment of how well different models perform in predicting BMI Z-scores, making it easier to identify the model that provides the most accurate predictions.
By employing a diverse set of models, this analysis aims to identify the most significant predictors of BMI Z-scores and understand the complex interactions between dietary intake, chemical exposures, and metabolomic serum data. The combination of regularization techniques and ensemble methods ensures a comprehensive and reliable assessment of the data.
#selected chemicals that were noted in enet
chemicals_selected <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2")
#selected diets that were noted in enet
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
combined_data_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter",
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
finalized_columns <- c(combined_data_selected)
final_selected_data <- exposome %>% dplyr::select(all_of(finalized_columns))
finalized_data <- cbind(outcome_cov, final_selected_data)
head(finalized_data)
numeric_finalized <- finalized_data %>%
dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_finalized, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)
find_highly_correlated <- function(cor_matrix, threshold = 0.8) {
cor_matrix[lower.tri(cor_matrix, diag = TRUE)] <- NA
cor_matrix <- as.data.frame(as.table(cor_matrix))
cor_matrix <- na.omit(cor_matrix)
cor_matrix <- cor_matrix[order(-abs(cor_matrix$Freq)), ]
cor_matrix <- cor_matrix %>% filter(abs(Freq) > threshold)
return(cor_matrix)
}
highly_correlated_pairs <- find_highly_correlated(cor_matrix, threshold = 0.50)
highly_correlated_pairs
The correlation plot for the selected variables indicates notable multicollinearity among various chemical variables and the child age covariate. Using grouped regression models like LASSO, ridge, and elastic net allows for the collective handling of these highly correlated variables. This approach helps with overfitting.
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
baseline_data <- finalized_data %>% dplyr::select(c(covariates_selected, "hs_zbmi_who"))
x <- model.matrix(~ . -1, data = baseline_data[ , !names(baseline_data) %in% "hs_zbmi_who"])
y <- baseline_data$hs_zbmi_who
train_control <- trainControl(method = "cv", number = 10)
set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
x_train <- x[trainIndex, ]
y_train <- y[trainIndex]
x_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
set.seed(101)
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Baseline Lasso RMSE:", rmse_lasso, "\n")
## Baseline Lasso RMSE: 1.152017
lasso_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)
cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.152017
As λ increases, the penalty for the coefficients becomes stronger, driving some coefficients to zero. The optimal λ (lambda.min) is chosen as the one that minimizes the cross-validated MSE. The coefficients listed show the non-zero variables selected by LASSO. In this case, children’s age is selected as significant.
set.seed(101)
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 8.732635e-01
## hs_child_age_None -5.738585e-02
## e3_sex_Nonefemale -3.110211e-38
## e3_sex_Nonemale 3.110211e-38
## e3_yearbir_None2004 -1.463782e-37
## e3_yearbir_None2005 2.704907e-38
## e3_yearbir_None2006 -9.332751e-39
## e3_yearbir_None2007 -5.445405e-38
## e3_yearbir_None2008 3.129795e-38
## e3_yearbir_None2009 3.620840e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Baseline Ridge RMSE:", rmse_ridge, "\n")
## Baseline Ridge RMSE: 1.152017
ridge_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)
cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.149128
The coefficients show that age has the most significant effect, similar to LASSO, with other variables having very small coefficients. The λ was optimized using the same cross-validation approach as LASSO. Ridge regression performs similarly to LASSO in terms of RMSE.
set.seed(101)
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Baseline Elastic Net RMSE:", rmse_enet, "\n")
## Baseline Elastic Net RMSE: 1.152017
enet_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)
cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.152017
Elastic Net uses cross-validation to find the best balance between LASSO and Ridge penalties. The Elastic Net RMSE is the same as LASSO, indicating similar predictive performance.
set.seed(101)
trainIndex <- createDataPartition(baseline_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- baseline_data[trainIndex, ]
test_data <- baseline_data[-trainIndex, ]
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_child_age_None
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.011581 0 1.00000 1.0026 0.051032
## 2 0.010000 1 0.98842 1.0130 0.051854
plotcp(fit_tree)
# getting the optimal cp value that minimizes the cross-validation error
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
cat("Optimal cp value:", optimal_cp, "\n")
## Optimal cp value: 0.01158106
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Baseline Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Baseline Pruned Decision Tree RMSE: 1.148665
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.007
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.155427
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.155427
The decision tree plot shows a simple model with age as the primary split. The complexity parameter (cp) controls tree pruning. The optimal cp was determined by cross-validation, selecting the cp that minimized cross-validation error. The RMSE shows that pruning improved the model slightly but was not as effective as regularization methods.
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500, importance = TRUE)
varImpPlot(fit_rf)
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Baseline Random Forest RMSE:", rmse_rf, "\n")
## Baseline Random Forest RMSE: 1.157731
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
## note: only 8 unique complexity parameters in default grid. Truncating the grid to 8 .
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.154789
Age is the most important variable, followed by year of birth. Random Forest performs similarly to other models in terms of RMSE.
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Baseline GBM RMSE:", rmse_gbm, "\n")
## Baseline GBM RMSE: 1.14888
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.149695
Age has the highest relative influence. GBM shows slightly better performance compared to some other models.
LASSO, Ridge, and Elastic Net: Provide similar predictive performance, with all models effectively handling multicollinearity.
Decision Tree: Simple and interpretable, but less effective than regularization methods.
Random Forest and GBM: Offer competitive performance, with GBM slightly better in RMSE.
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
diet_data <- finalized_data %>% dplyr::select(c(covariates_selected, diet_selected, "hs_zbmi_who"))
set.seed(101)
trainIndex <- createDataPartition(diet_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- diet_data[trainIndex, ]
test_data <- diet_data[-trainIndex, ]
train_control <- trainControl(method = "cv", number = 10)
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] .
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] .
## hs_total_sweets_Ter(4.1,8.5] .
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Diet + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Diet + Covariates Lasso RMSE: 1.138756
lasso_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)
cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.139468
The coefficient paths plot for LASSO regression (shown above) displays the changes in the coefficients of the predictors as the regularization parameter λ varies. The vertical dashed lines indicate the optimal λ values chosen by cross-validation to minimize the Mean Squared Error (MSE). As λ increases, more coefficients are shrunk to zero, resulting in a simpler model.
The optimal λ is selected based on the 10-fold cross-validation procedure, which evaluates the model performance on different subsets of the data. The λ value corresponding to the lowest cross-validated error is chosen to ensure the best predictive performance while preventing overfitting.
The LASSO model for diet variables addition shows an RMSE of approximately 1.14, indicating the average error between the predicted and actual BMI Z-scores. The LASSO model identified child’s age as a significant predictor with a non-zero coefficient. Most diet-related variables were shrunk to zero, indicating their lesser impact when combined with covariates.
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 8.732635e-01
## hs_child_age_None -5.738585e-02
## e3_sex_Nonefemale -3.392957e-38
## e3_sex_Nonemale 3.392957e-38
## e3_yearbir_None2004 -1.596853e-37
## e3_yearbir_None2005 2.950807e-38
## e3_yearbir_None2006 -1.018118e-38
## e3_yearbir_None2007 -5.940442e-38
## e3_yearbir_None2008 3.414322e-38
## e3_yearbir_None2009 3.950007e-37
## h_bfdur_Ter(10.8,34.9] 1.959443e-37
## h_bfdur_Ter(34.9,Inf] -1.196150e-37
## hs_bakery_prod_Ter(2,6] 2.102760e-37
## hs_bakery_prod_Ter(6,Inf] -1.896677e-37
## hs_break_cer_Ter(1.1,5.5] 9.801144e-39
## hs_break_cer_Ter(5.5,Inf] -2.286856e-37
## hs_dairy_Ter(14.6,25.6] 5.634165e-38
## hs_dairy_Ter(25.6,Inf] 8.216707e-41
## hs_fastfood_Ter(0.132,0.5] 1.215262e-37
## hs_fastfood_Ter(0.5,Inf] -6.235230e-38
## hs_org_food_Ter(0.132,1] 8.146218e-38
## hs_org_food_Ter(1,Inf] -1.900266e-37
## hs_proc_meat_Ter(1.5,4] 1.193171e-37
## hs_proc_meat_Ter(4,Inf] -1.994408e-38
## hs_total_fish_Ter(1.5,3] -8.902720e-38
## hs_total_fish_Ter(3,Inf] 2.141752e-38
## hs_total_fruits_Ter(7,14.1] 2.168036e-37
## hs_total_fruits_Ter(14.1,Inf] -2.040600e-37
## hs_total_lipids_Ter(3,7] -9.170321e-39
## hs_total_lipids_Ter(7,Inf] -2.153479e-37
## hs_total_sweets_Ter(4.1,8.5] -7.677763e-38
## hs_total_sweets_Ter(8.5,Inf] -1.716694e-38
## hs_total_veg_Ter(6,8.5] 2.132182e-38
## hs_total_veg_Ter(8.5,Inf] -1.282604e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Diet + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Diet + Covariates Ridge RMSE: 1.130793
ridge_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)
cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.129036
The coefficient paths plot for Ridge regression shows the changes in the coefficients as λ varies. Unlike LASSO, Ridge regression shrinks the coefficients but does not set them to zero. This results in all predictors contributing to the model, though some may have very small coefficients.
The Ridge model with the addition of diet variables shows an RMSE of around 1.13, suggesting slightly better performance compared to the LASSO model.
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] .
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] .
## hs_total_sweets_Ter(4.1,8.5] .
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Diet + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Diet + Covariates Elastic Net RMSE: 1.136809
enet_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)
cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.138767
The coefficient paths for Elastic Net combine the features of both LASSO and Ridge regressions, balancing between setting some coefficients to zero (like LASSO) and shrinking others (like Ridge). The paths display the progression of coefficients as λ varies.
The Elastic Net model for Diet + Covariates also performs well, with an RMSE close to 1.13, comparable to the Ridge regression performance.
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Diet + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Covariates Decision Tree RMSE: 1.152094
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_bakery_prod_Ter hs_break_cer_Ter hs_child_age_None
## [4] hs_total_lipids_Ter
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.011773 0 1.00000 1.0010 0.050839
## 2 0.010484 3 0.96468 1.0155 0.052218
## 3 0.010000 4 0.95420 1.0156 0.052011
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Diet + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Diet + Covariates Decision Tree RMSE: 1.148665
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.1
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.152094
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.148665
The decision tree split on child’s age first, indicating it is the most important variable. Subsequent splits involve dietary variables like bakery products, cereal, and total lipids.
The tree was pruned using the complexity parameter (cp) to avoid overfitting. The optimal cp was found using cross-validation.
Pruning slightly improved the model’s performance by reducing overfitting.
set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 233.57368
## e3_sex_None 34.19375
## e3_yearbir_None 97.75201
## h_bfdur_Ter 65.45676
## hs_bakery_prod_Ter 64.03655
## hs_break_cer_Ter 65.99478
## hs_dairy_Ter 69.72797
## hs_fastfood_Ter 55.45883
## hs_org_food_Ter 65.89629
## hs_proc_meat_Ter 68.85280
## hs_total_fish_Ter 63.72114
## hs_total_fruits_Ter 64.85756
## hs_total_lipids_Ter 65.79002
## hs_total_sweets_Ter 66.74543
## hs_total_veg_Ter 69.89130
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Diet + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Covariates Random Forest RMSE: 1.139899
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.129478
Random forest identified child’s age and year of birth as the most important variables. The model provided a robust prediction by aggregating multiple decision trees, reducing variance, and handling nonlinear relationships well.
set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Covariates GBM RMSE: 1.137923
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.136542
GBM also highlighted age as the most influential variable, followed by birth year and several diet variables. The number of trees was optimized using cross-validation, indicated by the vertical line in the error plot. GBM showed strong predictive performance by combining boosting with decision trees, focusing on reducing bias and variance.
The inclusion of diet variables generally improved the model performance, highlighting the importance of considering dietary factors alongside traditional covariates in predicting BMI Z-score. All models consistently identified child’s age as a significant predictor. Regularization techniques like ridge and elastic net performed well, handling multicollinearity and ensuring robust predictions. The decision tree and random forest models highlighted the importance of specific dietary variables in combination with covariates. GBM provided a comprehensive approach by effectively combining multiple weak learners, optimizing performance.
chemicals_selected <- c(
"hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
"hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
"hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
chemical_data <- finalized_data %>% dplyr::select(c(covariates_selected, chemicals_selected, "hs_zbmi_who"))
set.seed(101)
trainIndex <- createDataPartition(chemical_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_data[trainIndex, ]
test_data <- chemical_data[-trainIndex, ]
train_control <- trainControl(method = "cv", number = 10)
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.201642945
## hs_child_age_None -0.042729967
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 -0.040205226
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_cd_c_Log2 -0.011319810
## hs_co_c_Log2 .
## hs_cs_c_Log2 .
## hs_cu_c_Log2 0.260654501
## hs_hg_c_Log2 .
## hs_mo_c_Log2 -0.069199903
## hs_pb_c_Log2 .
## hs_dde_cadj_Log2 -0.041339092
## hs_pcb153_cadj_Log2 -0.125818759
## hs_pcb170_cadj_Log2 -0.043143964
## hs_dep_cadj_Log2 -0.003677983
## hs_pbde153_cadj_Log2 -0.036720000
## hs_pfhxs_c_Log2 .
## hs_pfoa_c_Log2 -0.097979410
## hs_pfos_c_Log2 .
## hs_prpa_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.007077421
## hs_mibp_cadj_Log2 -0.025837249
## hs_mnbp_cadj_Log2 -0.004654193
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Chemical + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Chemical + Covariates Lasso RMSE: 1.040557
lasso_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)
cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.040727
The optimal lambda (λ) is chosen by cross-validation, which minimizes the Mean Squared Error (MSE). The optimal λ is identified near log(λ) ≈ -2.5, where the MSE is the lowest. Child age, cobalt (hs_co_c_Log2), and other chemical-related variables like copper (hs_cu_c_Log2) and mercury (hs_hg_c_Log2) have significant coefficients.
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.609471669
## hs_child_age_None -0.054184972
## e3_sex_Nonefemale -0.023174163
## e3_sex_Nonemale 0.023148148
## e3_yearbir_None2004 -0.128309522
## e3_yearbir_None2005 0.028409105
## e3_yearbir_None2006 0.053001523
## e3_yearbir_None2007 -0.027948121
## e3_yearbir_None2008 -0.015135402
## e3_yearbir_None2009 0.235538660
## hs_cd_c_Log2 -0.036089222
## hs_co_c_Log2 -0.030180059
## hs_cs_c_Log2 0.066973465
## hs_cu_c_Log2 0.319809539
## hs_hg_c_Log2 -0.014578657
## hs_mo_c_Log2 -0.071569755
## hs_pb_c_Log2 -0.039355400
## hs_dde_cadj_Log2 -0.048885680
## hs_pcb153_cadj_Log2 -0.115503946
## hs_pcb170_cadj_Log2 -0.036730018
## hs_dep_cadj_Log2 -0.011840831
## hs_pbde153_cadj_Log2 -0.027003858
## hs_pfhxs_c_Log2 -0.026465601
## hs_pfoa_c_Log2 -0.101071688
## hs_pfos_c_Log2 -0.025692420
## hs_prpa_cadj_Log2 0.001538225
## hs_mbzp_cadj_Log2 0.039172212
## hs_mibp_cadj_Log2 -0.035550506
## hs_mnbp_cadj_Log2 -0.039493899
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Chemical + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Chemical + Covariates Ridge RMSE: 1.035505
ridge_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)
cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.035411
The optimal lambda is found around log(λ) ≈ 0, where MSE is minimized. Multiple variables, including age, birth year, and various chemical variables, have significant coefficients.
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.30722494
## hs_child_age_None -0.04475073
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 -0.06251662
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_cd_c_Log2 -0.01538343
## hs_co_c_Log2 .
## hs_cs_c_Log2 .
## hs_cu_c_Log2 0.27721103
## hs_hg_c_Log2 .
## hs_mo_c_Log2 -0.07149465
## hs_pb_c_Log2 .
## hs_dde_cadj_Log2 -0.04498749
## hs_pcb153_cadj_Log2 -0.12563741
## hs_pcb170_cadj_Log2 -0.04301914
## hs_dep_cadj_Log2 -0.00523836
## hs_pbde153_cadj_Log2 -0.03600515
## hs_pfhxs_c_Log2 .
## hs_pfoa_c_Log2 -0.10302534
## hs_pfos_c_Log2 .
## hs_prpa_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.01485361
## hs_mibp_cadj_Log2 -0.02916026
## hs_mnbp_cadj_Log2 -0.01195464
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Chemical + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Chemical + Covariates Elastic Net RMSE: 1.040324
enet_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)
cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.035456
Elastic Net combines LASSO and Ridge penalties, leading to variable selection and coefficient shrinkage. The optimal lambda is chosen near log(λ) ≈ -2.5.
Variables:
hs_child_age_None: -0.0447
hs_co_c_Log2: 0.2772 (positive association)
hs_pfoa_c_Log2: -0.1030
hs_cu_c_Log2: 0.0154 (slight positive association)
hs_pcb170_cadj_Log2: -0.1256
Elastic Net confirms the significance of certain chemicals like hs_co_c and hs_cu_c while also identifying the negative association of hs_pfoa_c.
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Chemical + Covariates Decision Tree RMSE: 1.108581
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_mbzp_cadj_Log2
## [4] hs_mnbp_cadj_Log2 hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2
## [7] hs_pfoa_c_Log2
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.077388 0 1.00000 1.00089 0.050861
## 2 0.032769 1 0.92261 0.93652 0.048377
## 3 0.032579 2 0.88984 0.91682 0.046750
## 4 0.025008 3 0.85726 0.91218 0.046570
## 5 0.014062 4 0.83226 0.89795 0.045241
## 6 0.013674 6 0.80413 0.89837 0.046138
## 7 0.013602 7 0.79046 0.89907 0.046094
## 8 0.013517 8 0.77686 0.89907 0.046094
## 9 0.011314 9 0.76334 0.90679 0.046317
## 10 0.011047 11 0.74071 0.91227 0.047148
## 11 0.010329 12 0.72967 0.92524 0.047400
## 12 0.010000 14 0.70901 0.92466 0.047181
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Chemical + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Chemical + Covariates Decision Tree RMSE: 1.089746
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.023
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.108581
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.089746
The decision tree diagram shows the splits based on different chemical covariates. The root node splits on Polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2), indicating its high importance.
Variables such as hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, and hs_dde_cadj_Log2 are prominent, as they appear at the top levels of the tree.
The unpruned decision tree has an RMSE of 1.186, and after pruning using the optimal complexity parameter (cp), the RMSE is slightly reduced to 1.090, indicating improved model performance.
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
rf_predictions <- predict(fit_rf, newdata = test_data)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 48.822971
## e3_sex_None 6.356073
## e3_yearbir_None 34.327078
## hs_cd_c_Log2 49.069793
## hs_co_c_Log2 51.682060
## hs_cs_c_Log2 42.322759
## hs_cu_c_Log2 66.200851
## hs_hg_c_Log2 53.722138
## hs_mo_c_Log2 59.137163
## hs_pb_c_Log2 46.948933
## hs_dde_cadj_Log2 81.399724
## hs_pcb153_cadj_Log2 86.076191
## hs_pcb170_cadj_Log2 132.902458
## hs_dep_cadj_Log2 55.402055
## hs_pbde153_cadj_Log2 100.134593
## hs_pfhxs_c_Log2 44.500787
## hs_pfoa_c_Log2 60.306554
## hs_pfos_c_Log2 51.809387
## hs_prpa_cadj_Log2 50.765210
## hs_mbzp_cadj_Log2 57.448789
## hs_mibp_cadj_Log2 45.657870
## hs_mnbp_cadj_Log2 53.036576
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Chemical + Covariates Random Forest RMSE: 1.031425
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.031669
The random forest model’s variable importance plot highlights the most influential variables based on the increase in node purity. hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, and hs_dde_cadj_Log2 are the top contributors.
The random forest model achieves an RMSE of 1.032, demonstrating its robust predictive performance due to the ensemble of decision trees.
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Chemical + Covariates GBM RMSE: 1.076264
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.040439
The GBM model’s relative influence plot shows that hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, and hs_dde_cadj_Log2 have the highest relative influence on the model’s predictions.
The GBM model has an RMSE of 1.076, and the cross-validated RMSE is 1.040, indicating good predictive accuracy and generalization to unseen data.
LASSO, Ridge, and Elastic Net Models: These models effectively identify significant chemical covariates, with optimization achieved through cross-validation to select the best λ. Chemical-related variables (hs_cd_c_Log2, hs_co_c_Log2, etc.) also show significance, indicating their influence on the outcome. Ridge regression and elastic net provide lower RMSE values, suggesting better generalization compared to LASSO.
Decision Tree: Provides an intuitive model with key splits on important chemical covariates but may overfit without pruning.
Random Forest: Offers robust performance by aggregating multiple trees, highlighting key variables and achieving low RMSE.
GBM: Combines boosting with tree-based methods to enhance predictive accuracy, with clear identification of influential variables.
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
diet_selected <- c(
"h_bfdur_Ter", "hs_bakery_prod_Ter", "hs_break_cer_Ter", "hs_dairy_Ter",
"hs_fastfood_Ter", "hs_org_food_Ter", "hs_proc_meat_Ter", "hs_total_fish_Ter",
"hs_total_fruits_Ter", "hs_total_lipids_Ter", "hs_total_sweets_Ter", "hs_total_veg_Ter"
)
chemicals_selected <- c(
"hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
"hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
"hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
combined_selected <- c(covariates_selected, diet_selected, chemicals_selected)
chemical_diet_cov_data <- finalized_data %>% dplyr::select(all_of(c(combined_selected, "hs_zbmi_who")))
set.seed(101)
trainIndex <- createDataPartition(chemical_diet_cov_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_diet_cov_data[trainIndex,]
test_data <- chemical_diet_cov_data[-trainIndex,]
train_control <- trainControl(method = "cv", number = 10)
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
# make the group_indices vector
group_indices <- c(
rep(1, num_covariates), # Group 1: Covariates
rep(2, num_diet), # Group 2: Diet
rep(3, num_chemicals) # Group 3: Chemicals
)
# adjust length if needed
if (length(group_indices) < ncol(x_train)) {
group_indices <- c(group_indices, rep(4, ncol(x_train) - length(group_indices)))
}
length(group_indices) == ncol(x_train) # should be TRUE
## [1] TRUE
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")
group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group Lasso RMSE:", rmse_group_lasso, "\n")
## Group Lasso RMSE: 1.042529
set.seed(101)
cv_group_lasso <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_lasso$lambda.min, "\n")
## Optimal lambda: 0.02299869
coef_lasso <- coef(cv_group_lasso, s = "lambda.min")
sig_vars_lasso <- coef_lasso[coef_lasso != 0]
sig_vars_lasso <- sig_vars_lasso[names(sig_vars_lasso) != "(Intercept)"]
sig_vars_lasso_df <- data.frame(
Variable = names(sig_vars_lasso),
Coefficient = as.numeric(sig_vars_lasso)
)
sig_vars_lasso_df
group_lasso_predictions <- predict(cv_group_lasso, x_test, s = "lambda.min", type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group LASSO RMSE with 10-fold CV:", rmse_group_lasso, "\n")
## Group LASSO RMSE with 10-fold CV: 1.031241
The cross-validated RMSE is 1.031, showing consistency in performance across different data subsets.
Significant Variables:
e3_yearbir_None2009: The positive coefficient (0.349) suggests that being born in 2009 is associated with an increase in the BMI Z-score.
hs_cu_c_Log2: The positive coefficient (0.478) suggests that higher levels of copper are strongly associated with an increase in the BMI Z-score.
hs_cd_c_Log2: The negative coefficient (-0.047) indicates that higher levels of Cadmium are associated with a decrease in BMI Z-score.
hs_mibp_cadj_Log2: The negative coefficient (-0.077) indicates that higher levels of MiBP are associated with a decrease in BMI Z-score.
hs_total_fruits_Ter(7,14.1]: The positive coefficient (0.059) suggests that consuming fruits in this range is associated with an increase in the outcome variable.
These variables have the most significant influence on the outcome based on the coefficients from the group LASSO model.
set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 1.042024
set.seed(101)
cv_group_ridge_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_ridge_model$lambda.min, "\n")
## Optimal lambda: 0.04019086
coef_ridge <- coef(cv_group_ridge_model, s = "lambda.min")
sig_vars_ridge <- coef_ridge[coef_ridge != 0]
sig_vars_ridge <- sig_vars_ridge[names(sig_vars_ridge) != "(Intercept)"]
sig_vars_ridge_df <- data.frame(
Variable = names(sig_vars_ridge),
Coefficient = as.numeric(sig_vars_ridge)
)
sig_vars_ridge_df
group_ridge_predictions <- predict(cv_group_ridge_model, x_test, s = "lambda.min", type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE with 10-fold CV:", rmse_group_ridge, "\n")
## Group Ridge RMSE with 10-fold CV: 1.03338
e3_yearbir_None2009: This variable indicates the year of birth of the children in 2009, having a positive coefficient (0.465), suggesting that being born in 2009 is associated with an increase in the response variable compared to the reference year.
hs_cu_c_Log2: This variable represents the log-transformed concentration of Copper (Cu) in serum. A positive coefficient (0.606) indicates that higher concentrations of copper are associated with an increase in the response variable.
hs_mbzp_cadj_Log2: This variable represents the log-transformed concentration of Mono-Benzyl Phthalate (MBzP), a type of phthalate, in serum. A positive coefficient (0.110) suggests that higher levels of MBzP are associated with an increase in the response variable.
hs_break_cer_Ter(5.5,Inf]: This variable represents a high intake of breakfast cereals (tertile > 5.5 servings per week). A negative coefficient (-0.115) indicates that higher consumption of breakfast cereals is associated with a decrease in the response variable.
hs_pbcb153_cadj_Log2: This variable represents the log-transformed concentration of Polychlorinated Biphenyl (PCB-153), a type of PCB, in serum. A negative coefficient (-0.199) suggests that higher levels of PCB-153 are associated with a decrease in the response variable.
set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 1.043394
set.seed(101)
cv_group_enet_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_enet_model$lambda.min, "\n")
## Optimal lambda: 0.03662041
coef_enet <- coef(cv_group_enet_model, s = "lambda.min")
sig_vars_enet <- coef_enet[coef_enet != 0]
sig_vars_enet <- sig_vars_enet[names(sig_vars_enet) != "(Intercept)"]
sig_vars_enet_df <- data.frame(
Variable = names(sig_vars_enet),
Coefficient = as.numeric(sig_vars_enet)
)
sig_vars_enet_df
group_enet_predictions <- predict(cv_group_enet_model, x_test, s = "lambda.min", type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE with 10-fold CV:", rmse_group_enet, "\n")
## Group Elastic Net RMSE with 10-fold CV: 1.032168
e3_yearbir_None2009: The year of birth 2009 is positively associated (0.2673) with the outcome variable. This suggests that being born in 2009 is linked with an increase in the dependent variable.
hs_cu_c_Log2: Higher log-transformed copper levels are strongly positively associated (0.544) with the outcome, indicating that higher copper concentrations are linked with an increase in the dependent variable.
h_bfdur_Ter(10.8,34.9]: This breastfeeding duration category is positively associated (0.087) with the outcome, suggesting that breastfeeding for 10.8 to 34.9 months is linked with an increase in the dependent variable.
hs_pb_c_Log2: Higher log-transformed lead levels are negatively associated ((-0.096)) with the outcome, suggesting that higher lead concentrations are linked with a decrease in the dependent variable.
e3_yearbir_None2004: The year of birth 2004 is negatively associated (-0.125) with the outcome, indicating that being born in 2004 is linked with a decrease in the dependent variable.
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Diet + Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Chemical + Covariates Decision Tree RMSE: 1.128749
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_fastfood_Ter
## [4] hs_mnbp_cadj_Log2 hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2
## [7] hs_pfoa_c_Log2
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.077388 0 1.00000 1.00089 0.050861
## 2 0.032769 1 0.92261 0.93652 0.048377
## 3 0.032579 2 0.88984 0.91682 0.046750
## 4 0.025008 3 0.85726 0.91218 0.046570
## 5 0.015671 4 0.83226 0.90443 0.045712
## 6 0.013674 5 0.81658 0.91485 0.047377
## 7 0.013602 6 0.80291 0.91478 0.047609
## 8 0.013517 7 0.78931 0.91478 0.047609
## 9 0.011314 8 0.77579 0.93397 0.049086
## 10 0.011047 10 0.75317 0.96270 0.050245
## 11 0.010296 11 0.74212 0.98215 0.050293
## 12 0.010000 12 0.73182 0.98607 0.050369
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Diet + Chemical + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Diet + Chemical + Covariates Decision Tree RMSE: 1.089746
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.023
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.128749
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.089746
The RMSE for the unpruned decision tree is 1.129, and after pruning, the RMSE is 1.090. Pruning improves the model’s performance by reducing overfitting.
Important Variables:
hs_pcb170_cadj_Log2
hs_pbde153_cadj_Log2
hs_dde_cadj_Log2
hs_mnbp_cadj_Log2
hs_pfoa_c_Log2
These variables are crucial for splitting the data effectively in the tree.
set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
par(mar = c(6, 14, 4, 4) + 0.1)
varImpPlot(fit_rf, cex = 0.6)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 41.267546
## e3_sex_None 5.540866
## e3_yearbir_None 31.564485
## h_bfdur_Ter 13.136212
## hs_bakery_prod_Ter 20.689578
## hs_break_cer_Ter 13.679182
## hs_dairy_Ter 11.181729
## hs_fastfood_Ter 11.895794
## hs_org_food_Ter 10.856862
## hs_proc_meat_Ter 10.827847
## hs_total_fish_Ter 10.984827
## hs_total_fruits_Ter 13.977047
## hs_total_lipids_Ter 12.980031
## hs_total_sweets_Ter 10.749403
## hs_total_veg_Ter 12.258026
## hs_cd_c_Log2 44.886048
## hs_co_c_Log2 45.054544
## hs_cs_c_Log2 34.878871
## hs_cu_c_Log2 60.387604
## hs_hg_c_Log2 43.835480
## hs_mo_c_Log2 50.597285
## hs_pb_c_Log2 39.126962
## hs_dde_cadj_Log2 71.115165
## hs_pcb153_cadj_Log2 74.294671
## hs_pcb170_cadj_Log2 130.998985
## hs_dep_cadj_Log2 50.333164
## hs_pbde153_cadj_Log2 92.865989
## hs_pfhxs_c_Log2 38.593526
## hs_pfoa_c_Log2 53.392093
## hs_pfos_c_Log2 44.360304
## hs_prpa_cadj_Log2 43.848750
## hs_mbzp_cadj_Log2 49.661827
## hs_mibp_cadj_Log2 38.399318
## hs_mnbp_cadj_Log2 45.955350
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Diet + Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Covariates Random Forest RMSE: 1.024848
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.025544
The RMSE for the Random Forest model is 1.025, and the cross-validated RMSE is 1.026, indicating strong performance.
Important Variables:
hs_pcb170_cadj_Log2
hs_pbde153_cadj_Log2
hs_dde_cadj_Log2
hs_cu_c_Log2
hs_pfoa_c_Log2
set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Covariates GBM RMSE: 1.072432
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.048183
The RMSE for the GBM model is 1.072432, and the cross-validated RMSE is 1.048183, showing good accuracy and improvement over iterations.
Important Variables (based on Relative Influence):
hs_pcb170_cadj_Log2
hs_pbde153_cadj_Log2
hs_dde_cadj_Log2
hs_pfoa_c_Log2
hs_cu_c_Log2
These variables have the highest relative influence in the model, contributing significantly to reducing prediction errors.
In terms of the metabolomic data, the values were excluded since there were too many entries that unable to be imputed by mean or median.
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
metabol_serum_transposed <- as.data.frame(t(metabol_serum.d))
metabol_serum_transposed$ID <- as.integer(rownames(metabol_serum_transposed))
# add the ID column to the first position
metabol_serum_transposed <- metabol_serum_transposed[, c("ID", setdiff(names(metabol_serum_transposed), "ID"))]
# ID is the first column, and the layout is preserved
kable(head(metabol_serum_transposed), align = "c", digits = 2, format = "pipe")
| ID | metab_1 | metab_2 | metab_3 | metab_4 | metab_5 | metab_6 | metab_7 | metab_8 | metab_9 | metab_10 | metab_11 | metab_12 | metab_13 | metab_14 | metab_15 | metab_16 | metab_17 | metab_18 | metab_19 | metab_20 | metab_21 | metab_22 | metab_23 | metab_24 | metab_25 | metab_26 | metab_27 | metab_28 | metab_29 | metab_30 | metab_31 | metab_32 | metab_33 | metab_34 | metab_35 | metab_36 | metab_37 | metab_38 | metab_39 | metab_40 | metab_41 | metab_42 | metab_43 | metab_44 | metab_45 | metab_46 | metab_47 | metab_48 | metab_49 | metab_50 | metab_51 | metab_52 | metab_53 | metab_54 | metab_55 | metab_56 | metab_57 | metab_58 | metab_59 | metab_60 | metab_61 | metab_62 | metab_63 | metab_64 | metab_65 | metab_66 | metab_67 | metab_68 | metab_69 | metab_70 | metab_71 | metab_72 | metab_73 | metab_74 | metab_75 | metab_76 | metab_77 | metab_78 | metab_79 | metab_80 | metab_81 | metab_82 | metab_83 | metab_84 | metab_85 | metab_86 | metab_87 | metab_88 | metab_89 | metab_90 | metab_91 | metab_92 | metab_93 | metab_94 | metab_95 | metab_96 | metab_97 | metab_98 | metab_99 | metab_100 | metab_101 | metab_102 | metab_103 | metab_104 | metab_105 | metab_106 | metab_107 | metab_108 | metab_109 | metab_110 | metab_111 | metab_112 | metab_113 | metab_114 | metab_115 | metab_116 | metab_117 | metab_118 | metab_119 | metab_120 | metab_121 | metab_122 | metab_123 | metab_124 | metab_125 | metab_126 | metab_127 | metab_128 | metab_129 | metab_130 | metab_131 | metab_132 | metab_133 | metab_134 | metab_135 | metab_136 | metab_137 | metab_138 | metab_139 | metab_140 | metab_141 | metab_142 | metab_143 | metab_144 | metab_145 | metab_146 | metab_147 | metab_148 | metab_149 | metab_150 | metab_151 | metab_152 | metab_153 | metab_154 | metab_155 | metab_156 | metab_157 | metab_158 | metab_159 | metab_160 | metab_161 | metab_162 | metab_163 | metab_164 | metab_165 | metab_166 | metab_167 | metab_168 | metab_169 | metab_170 | metab_171 | metab_172 | metab_173 | metab_174 | metab_175 | metab_176 | metab_177 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 430 | 430 | -2.15 | -0.71 | 8.60 | 0.55 | 7.05 | 5.79 | 3.75 | 5.07 | -1.87 | -2.77 | -3.31 | -2.91 | -2.94 | -1.82 | -4.40 | -4.10 | -5.41 | -5.13 | -5.35 | -3.39 | -5.08 | -6.06 | -6.06 | -4.99 | -4.46 | -4.63 | -3.27 | -4.61 | 2.17 | -1.73 | -4.97 | -4.90 | -2.63 | -5.29 | -2.38 | -4.06 | -5.11 | -5.35 | -4.80 | -3.92 | -3.92 | -5.47 | -4.22 | -2.56 | -3.93 | 5.15 | 6.03 | 10.20 | 5.14 | 7.82 | 12.31 | 7.27 | 7.08 | 1.79 | 7.73 | 7.98 | 1.96 | 6.15 | 0.98 | 0.60 | 4.42 | 4.36 | 5.85 | 1.03 | 2.74 | -2.53 | -2.05 | -2.91 | -1.61 | -1.63 | 5.03 | 0.14 | 6.23 | -2.95 | 1.29 | 1.70 | -2.83 | 4.55 | 4.05 | 2.56 | -0.29 | 8.33 | 9.93 | 4.89 | 1.28 | 2.16 | 5.82 | 8.95 | 7.72 | 8.41 | 4.71 | 0.10 | 2.02 | 0.16 | 5.82 | 7.45 | 6.17 | 6.81 | -0.70 | -1.25 | -0.65 | 2.05 | 3.39 | 4.94 | -0.69 | -1.44 | -2.06 | -2.44 | -1.30 | -0.73 | -1.52 | -2.43 | -3.26 | 1.97 | 0.03 | 1.09 | 3.98 | 4.56 | 4.16 | 0.42 | 3.48 | 4.88 | 3.84 | 4.70 | 4.04 | 1.58 | -0.76 | 1.75 | 2.48 | 4.43 | 4.68 | 3.29 | 0.97 | 1.03 | 0.44 | 1.55 | 2.26 | 2.72 | 0.12 | -0.90 | -0.50 | 0.02 | -0.18 | 1.02 | -2.69 | -1.66 | 0.47 | 0.28 | 6.75 | 7.67 | -2.66 | -1.52 | 7.28 | -0.08 | 2.39 | 1.55 | 3.01 | 2.92 | -0.48 | 6.78 | 3.90 | 4.05 | 3.17 | -1.46 | 3.56 | 4.60 | -3.55 | -2.79 | -1.98 | -1.84 | 3.98 | 6.47 | 7.16 | -0.01 | 6.57 | 6.86 | 8.36 |
| 1187 | 1187 | -0.69 | -0.37 | 9.15 | -1.33 | 6.89 | 5.81 | 4.26 | 5.08 | -2.30 | -3.42 | -3.63 | -3.16 | -3.22 | -1.57 | -4.10 | -5.35 | -5.68 | -6.11 | -5.54 | -3.50 | -5.24 | -5.72 | -5.97 | -4.94 | -4.25 | -4.46 | -3.55 | -4.64 | 1.81 | -2.92 | -4.44 | -4.49 | -3.53 | -4.94 | -3.15 | -4.13 | -4.47 | -4.90 | -4.24 | -3.49 | -3.94 | -4.99 | -4.02 | -2.69 | -3.69 | 5.13 | 5.57 | 9.93 | 6.13 | 8.47 | 12.32 | 6.83 | 5.94 | 1.64 | 6.82 | 7.74 | 1.98 | 6.11 | 0.99 | 0.19 | 4.34 | 4.36 | 5.47 | 0.92 | 2.69 | -2.69 | -1.93 | -2.79 | -1.63 | -1.69 | 4.58 | 0.41 | 6.14 | -3.06 | 1.05 | 2.10 | -2.95 | 4.51 | 4.30 | 2.57 | 0.08 | 8.27 | 9.54 | 4.61 | 1.39 | 1.91 | 5.91 | 8.59 | 7.34 | 8.04 | 4.29 | -0.04 | 2.17 | 0.42 | 5.39 | 6.95 | 5.68 | 6.09 | -0.68 | -1.29 | -0.76 | 1.84 | 3.06 | 4.40 | -0.52 | -1.52 | -1.90 | -2.44 | -1.46 | -1.00 | -1.33 | -2.41 | -3.67 | 2.48 | 0.27 | 1.02 | 4.19 | 4.43 | 4.19 | 0.33 | 3.24 | 4.38 | 3.92 | 5.09 | 4.42 | 1.01 | -0.53 | 1.36 | 2.25 | 4.54 | 5.10 | 3.45 | 0.65 | 0.83 | 0.36 | 1.68 | 2.56 | 2.70 | 0.02 | -1.02 | -0.93 | -0.22 | 0.11 | 1.60 | -2.70 | -1.31 | 1.08 | 0.54 | 6.29 | 7.97 | -3.22 | -1.34 | 7.50 | 0.48 | 2.19 | 1.49 | 3.09 | 2.71 | -0.38 | 6.86 | 3.77 | 4.31 | 3.23 | -1.82 | 3.80 | 5.05 | -3.31 | -2.18 | -2.21 | -2.01 | 4.91 | 6.84 | 7.14 | 0.14 | 6.03 | 6.55 | 7.91 |
| 940 | 940 | -0.69 | -0.36 | 8.95 | -0.13 | 7.10 | 5.86 | 4.35 | 5.92 | -1.97 | -3.40 | -3.41 | -2.99 | -3.01 | -1.65 | -3.55 | -4.82 | -5.41 | -5.84 | -5.13 | -2.83 | -4.86 | -5.51 | -5.51 | -4.63 | -3.73 | -4.00 | -2.92 | -4.21 | 2.79 | -1.41 | -4.80 | -5.47 | -2.10 | -5.47 | -2.14 | -4.18 | -4.84 | -5.24 | -4.64 | -3.20 | -3.90 | -5.24 | -3.77 | -2.70 | -2.76 | 5.21 | 5.86 | 9.78 | 6.38 | 8.29 | 12.49 | 7.01 | 6.49 | 1.97 | 7.17 | 7.62 | 2.40 | 6.93 | 1.85 | 1.45 | 5.11 | 5.30 | 6.27 | 2.35 | 3.31 | -2.50 | -1.41 | -2.61 | -0.93 | -1.03 | 4.54 | 1.59 | 6.03 | -2.74 | 1.79 | 2.68 | -8.16 | 5.19 | 5.14 | 3.16 | 0.24 | 9.09 | 10.25 | 5.44 | 1.90 | 2.46 | 6.66 | 9.19 | 8.24 | 8.46 | 5.73 | 1.10 | 2.58 | 1.15 | 6.37 | 7.28 | 6.51 | 7.20 | -0.48 | -0.69 | -0.02 | 2.56 | 3.76 | 5.33 | -0.16 | -1.18 | -1.18 | -2.16 | -1.06 | -0.19 | -0.48 | -2.35 | -3.16 | 2.79 | 0.72 | 2.14 | 4.80 | 4.84 | 4.55 | 1.27 | 4.26 | 5.23 | 4.40 | 5.43 | 4.56 | 2.32 | 0.03 | 2.15 | 3.22 | 5.06 | 5.28 | 3.80 | 1.38 | 1.58 | 0.98 | 2.27 | 2.94 | 3.39 | 0.33 | -0.53 | 0.17 | 0.53 | 0.57 | 1.69 | -2.21 | -0.76 | 1.25 | 0.49 | 6.49 | 8.84 | -4.02 | -1.33 | 7.42 | 0.71 | 2.81 | 2.03 | 3.30 | 3.00 | -0.24 | 7.02 | 3.82 | 4.66 | 3.36 | -1.18 | 3.82 | 4.91 | -2.95 | -2.89 | -2.43 | -2.05 | 4.25 | 7.02 | 7.36 | 0.14 | 6.57 | 6.68 | 8.12 |
| 936 | 936 | -0.19 | -0.34 | 8.54 | -0.62 | 7.01 | 5.95 | 4.24 | 5.41 | -1.89 | -2.84 | -3.38 | -3.11 | -2.94 | -1.45 | -3.83 | -4.43 | -5.61 | -5.41 | -5.54 | -2.94 | -4.78 | -6.06 | -5.88 | -4.70 | -4.82 | -4.46 | -2.66 | -3.82 | 2.85 | -2.70 | -5.16 | -5.47 | -3.31 | -5.61 | -2.80 | -4.11 | -4.97 | -4.86 | -5.01 | -3.63 | -3.78 | -5.29 | -4.17 | -2.49 | -3.65 | 5.31 | 5.60 | 9.87 | 6.67 | 8.05 | 12.33 | 6.72 | 6.42 | 1.25 | 7.28 | 7.37 | 1.99 | 6.28 | 1.17 | 0.50 | 4.52 | 4.43 | 5.54 | 1.30 | 3.08 | -2.92 | -2.16 | -3.18 | -1.66 | -1.63 | 4.55 | 0.53 | 5.73 | -3.27 | 1.30 | 1.70 | -2.57 | 4.53 | 4.14 | 2.61 | -0.18 | 8.32 | 9.62 | 4.82 | 1.58 | 1.99 | 5.82 | 8.59 | 7.58 | 8.39 | 4.68 | 0.36 | 2.01 | -0.31 | 5.71 | 7.35 | 6.22 | 6.66 | -0.70 | -1.42 | -0.62 | 2.13 | 3.54 | 4.85 | -0.72 | -1.53 | -2.04 | -2.37 | -1.38 | -0.96 | -1.57 | -2.91 | -3.60 | 2.37 | 0.21 | 0.92 | 4.05 | 4.27 | 4.33 | 0.24 | 3.38 | 4.45 | 3.71 | 4.74 | 4.44 | 1.51 | -1.73 | 1.51 | 2.27 | 4.37 | 4.89 | 3.40 | 0.66 | 0.83 | 0.27 | 1.50 | 2.30 | 2.60 | 0.14 | -0.90 | -0.99 | -0.53 | -0.30 | 1.14 | -3.06 | -1.69 | 0.39 | 0.19 | 6.21 | 8.05 | -2.75 | -0.87 | 7.79 | 0.87 | 2.48 | 1.62 | 3.28 | 2.93 | -0.41 | 6.91 | 3.75 | 4.38 | 3.20 | -1.07 | 3.81 | 4.89 | -3.36 | -2.40 | -2.06 | -2.03 | 3.99 | 7.36 | 6.94 | 0.14 | 6.26 | 6.47 | 7.98 |
| 788 | 788 | -1.96 | -0.35 | 8.73 | -0.80 | 6.90 | 5.95 | 4.88 | 5.39 | -1.55 | -2.45 | -3.51 | -2.84 | -2.83 | -1.71 | -3.91 | -4.05 | -5.61 | -4.63 | -5.29 | -3.51 | -4.86 | -5.97 | -5.27 | -4.90 | -4.40 | -4.63 | -3.11 | -3.99 | 2.87 | -2.23 | -4.61 | -5.04 | -3.53 | -5.08 | -3.02 | -4.41 | -4.72 | -5.18 | -4.72 | -3.63 | -3.61 | -5.29 | -4.05 | -2.31 | -3.73 | 4.69 | 5.31 | 9.69 | 6.76 | 8.21 | 12.18 | 6.75 | 6.51 | 1.15 | 7.38 | 7.93 | 1.76 | 5.68 | -0.02 | -0.65 | 4.14 | 3.36 | 4.43 | 0.21 | 1.98 | -2.31 | -1.54 | -2.30 | -1.66 | -1.47 | 4.48 | 0.88 | 6.47 | -2.50 | 0.74 | 1.12 | -2.17 | 4.31 | 3.50 | 2.09 | -0.60 | 8.06 | 9.69 | 3.99 | 0.54 | 1.60 | 5.60 | 8.71 | 7.32 | 8.03 | 3.27 | -0.98 | 1.59 | -0.20 | 5.68 | 7.16 | 5.57 | 6.16 | -0.79 | -1.31 | -0.87 | 2.17 | 3.23 | 4.57 | -0.93 | -1.80 | -2.27 | -2.51 | -1.74 | -1.02 | -1.92 | -2.02 | -3.79 | 1.95 | -0.24 | 0.40 | 3.73 | 4.13 | 3.71 | 0.03 | 2.89 | 4.06 | 3.54 | 4.76 | 3.88 | 0.53 | -2.11 | 1.27 | 1.99 | 4.13 | 4.58 | 2.88 | 0.22 | 0.39 | 0.22 | 1.44 | 2.02 | 2.22 | 0.00 | -0.81 | -1.10 | -0.41 | -0.09 | 1.00 | -2.66 | -1.55 | 0.33 | 0.19 | 6.47 | 7.89 | -4.40 | -1.94 | 7.65 | 0.38 | 1.66 | 0.84 | 2.78 | 2.26 | -0.84 | 6.52 | 3.53 | 3.81 | 2.83 | -1.69 | 3.65 | 4.47 | -3.81 | -2.97 | -2.88 | -2.29 | 3.88 | 6.99 | 7.38 | -0.10 | 6.00 | 6.52 | 8.04 |
| 698 | 698 | -1.90 | -0.63 | 8.24 | -0.46 | 6.94 | 5.42 | 4.70 | 4.62 | -1.78 | -3.14 | -3.46 | -2.90 | -2.94 | -1.65 | -4.20 | -4.56 | -5.68 | -5.61 | -5.41 | -2.92 | -5.04 | -5.97 | -6.06 | -4.90 | -4.22 | -4.20 | -3.05 | -4.61 | 2.15 | -2.87 | -4.68 | -5.08 | -3.69 | -5.24 | -3.63 | -4.24 | -5.16 | -5.35 | -4.97 | -3.61 | -3.99 | -5.35 | -3.98 | -2.59 | -3.95 | 5.15 | 5.82 | 10.00 | 5.54 | 8.15 | 12.28 | 6.80 | 6.23 | 1.88 | 7.07 | 7.38 | 2.06 | 6.79 | 1.67 | 1.00 | 4.79 | 4.79 | 5.71 | 1.99 | 3.29 | -2.13 | -1.01 | -1.85 | -1.23 | -0.90 | 4.41 | -0.02 | 6.09 | -2.10 | 1.66 | 2.27 | -3.48 | 4.96 | 4.76 | 2.64 | 0.05 | 8.91 | 9.99 | 5.16 | 1.53 | 2.11 | 6.28 | 8.77 | 8.03 | 8.66 | 5.99 | 0.87 | 2.30 | 0.63 | 6.23 | 7.50 | 6.75 | 7.22 | -0.45 | -0.81 | -0.11 | 2.57 | 3.93 | 5.16 | -0.31 | -1.19 | -1.25 | -1.93 | -0.89 | 0.07 | -0.87 | -1.12 | -3.03 | 2.61 | 0.54 | 1.83 | 4.50 | 4.53 | 4.42 | 1.15 | 4.02 | 4.91 | 4.06 | 5.06 | 4.42 | 2.02 | -1.03 | 1.87 | 2.96 | 4.84 | 5.08 | 3.62 | 1.13 | 1.23 | 0.75 | 2.26 | 2.80 | 3.04 | 0.41 | -0.39 | 0.02 | 0.31 | 0.52 | 1.73 | -2.28 | -0.73 | 1.06 | 0.72 | 6.44 | 7.27 | -3.08 | -1.23 | 7.35 | 0.92 | 2.60 | 2.00 | 3.69 | 3.20 | -0.25 | 7.38 | 4.15 | 5.00 | 3.88 | -1.39 | 4.31 | 5.20 | -3.47 | -2.75 | -1.97 | -1.96 | 4.18 | 6.81 | 6.75 | 0.02 | 6.49 | 5.97 | 7.78 |
# specific covariates
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
outcome_and_cov <- cbind(covariates, outcome_BMI)
outcome_and_cov <- outcome_and_cov[, !duplicated(colnames(outcome_and_cov))]
outcome_and_cov <- outcome_and_cov %>%
dplyr::select(ID, hs_child_age_None, e3_sex_None, e3_yearbir_None, hs_zbmi_who)
#the full chemicals list
chemicals_specific <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
all_columns <- c(chemicals_specific, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
selected_id_data <- cbind(outcome_and_cov, extracted_exposome)
# ID is the common identifier in both datasets
combined_data <- merge(selected_id_data, metabol_serum_transposed, by = "ID", all = TRUE)
selected_metabolomics_data <- combined_data %>% dplyr::select(-c(ID))
head(selected_metabolomics_data)
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()
set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
metabol_train_data <- selected_metabolomics_data[trainIndex,]
metabol_test_data <- selected_metabolomics_data[-trainIndex,]
x_train <- model.matrix(hs_zbmi_who ~ . -1, metabol_train_data)
y_train <- metabol_train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, metabol_test_data)
y_test <- metabol_test_data$hs_zbmi_who
train_control <- trainControl(method = "cv", number = 10)
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
num_metabolomics <- ncol(metabol_serum_transposed) - 1 # subtract for the ID column
group_indices <- c(
rep(1, num_covariates), # Group 1: Covariates
rep(2, num_diet), # Group 2: Diet
rep(3, num_chemicals), # Group 3: Chemicals
rep(4, num_metabolomics) # Group 4: Metabolomics
)
if (length(group_indices) < ncol(x_train)) {
group_indices <- c(group_indices, rep(5, ncol(x_train) - length(group_indices)))
}
length(group_indices) == ncol(x_train) # this should be TRUE
## [1] TRUE
# fit group lasso model
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")
group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group Lasso Test RMSE:", rmse_group_lasso, "\n")
## Group Lasso Test RMSE: 0.9334466
set.seed(101)
cv_group_lasso <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_lasso$lambda.min, "\n")
## Optimal lambda: 0.01400055
coef_lasso <- coef(cv_group_lasso, s = "lambda.min")
sig_vars_lasso <- coef_lasso[coef_lasso != 0]
sig_vars_lasso <- sig_vars_lasso[names(sig_vars_lasso) != "(Intercept)"]
sig_vars_lasso_df <- data.frame(
Variable = names(sig_vars_lasso),
Coefficient = as.numeric(sig_vars_lasso)
)
sig_vars_lasso_df
group_lasso_predictions <- predict(cv_group_lasso, x_test, s = "lambda.min", type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group LASSO RMSE with 10-fold CV:", rmse_group_lasso, "\n")
## Group LASSO RMSE with 10-fold CV: 0.8748226
Group Lasso performs well with the lowest test RMSE and CV RMSE among the penalized regression models. The smaller optimal lambda indicates that fewer variables are being selected, resulting in a more parsimonious model with important predictors. This model balances variable selection and prediction accuracy effectively.
set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 0.9424999
set.seed(101)
cv_group_ridge_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_ridge_model$lambda.min, "\n")
## Optimal lambda: 0.02446636
coef_ridge <- coef(cv_group_ridge_model, s = "lambda.min")
sig_vars_ridge <- coef_ridge[coef_ridge != 0]
sig_vars_ridge <- sig_vars_ridge[names(sig_vars_ridge) != "(Intercept)"]
sig_vars_ridge_df <- data.frame(
Variable = names(sig_vars_ridge),
Coefficient = as.numeric(sig_vars_ridge)
)
sig_vars_ridge_df
group_ridge_predictions <- predict(cv_group_ridge_model, x_test, s = "lambda.min", type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE with 10-fold CV:", rmse_group_ridge, "\n")
## Group Ridge RMSE with 10-fold CV: 0.8849928
Group Ridge has slightly higher RMSE values compared to Group Lasso. The optimal lambda is higher, indicating more variables are included in the model. Ridge regression typically shrinks coefficients but includes all variables, which might result in less interpretability but robust predictions.
set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 0.9440745
set.seed(101)
cv_group_enet_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_enet_model$lambda.min, "\n")
## Optimal lambda: 0.0185079
coef_enet <- coef(cv_group_enet_model, s = "lambda.min")
sig_vars_enet <- coef_enet[coef_enet != 0]
sig_vars_enet <- sig_vars_enet[names(sig_vars_enet) != "(Intercept)"]
sig_vars_enet_df <- data.frame(
Variable = names(sig_vars_enet),
Coefficient = as.numeric(sig_vars_enet)
)
sig_vars_enet_df
group_enet_predictions <- predict(cv_group_enet_model, x_test, s = "lambda.min", type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE with 10-fold CV:", rmse_group_enet, "\n")
## Group Elastic Net RMSE with 10-fold CV: 0.8844928
Group Elastic Net combines Lasso and Ridge penalties, resulting in similar performance to Group Ridge. It provides a balance between variable selection and coefficient shrinkage, but with slightly higher RMSE values than Group Lasso.
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()
set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7,
list = FALSE,
times = 1)
train_data <- selected_metabolomics_data[ trainIndex,]
test_data <- selected_metabolomics_data[-trainIndex,]
x_train <- model.matrix(hs_zbmi_who ~ ., train_data)[,-1]
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ ., test_data)[,-1]
y_test <- test_data$hs_zbmi_who
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Full Decision Tree RMSE:", rmse_tree, "\n")
## Full Decision Tree RMSE: 1.243108
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_pcb170_cadj_Log2
## [4] metab_100 metab_129 metab_140
## [7] metab_141 metab_142 metab_157
## [10] metab_161 metab_176 metab_47
## [13] metab_58 metab_8 metab_94
## [16] metab_95
##
## Root node error: 1254/841 = 1.4911
##
## n= 841
##
## CP nsplit rel error xerror xstd
## 1 0.089116 0 1.00000 1.00275 0.052776
## 2 0.069078 1 0.91088 0.94590 0.050339
## 3 0.037249 2 0.84181 0.91981 0.049729
## 4 0.034324 3 0.80456 0.90111 0.048711
## 5 0.025576 4 0.77023 0.89055 0.049604
## 6 0.021936 5 0.74466 0.90166 0.048993
## 7 0.021276 6 0.72272 0.90895 0.049515
## 8 0.020244 8 0.68017 0.90807 0.050310
## 9 0.016938 9 0.65992 0.91277 0.050356
## 10 0.015553 10 0.64299 0.90306 0.050272
## 11 0.015178 11 0.62743 0.90122 0.050368
## 12 0.014756 12 0.61226 0.90066 0.050300
## 13 0.014022 13 0.59750 0.91203 0.051412
## 14 0.013859 14 0.58348 0.92227 0.051732
## 15 0.012010 15 0.56962 0.93488 0.053622
## 16 0.011088 16 0.55761 0.93327 0.053609
## 17 0.011027 17 0.54652 0.92988 0.053560
## 18 0.010000 18 0.53549 0.93467 0.053891
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.12821
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.036
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.243108
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.127666
Pruning improves the decision tree’s performance, but it still has a higher RMSE compared to penalized regression models. Pruning helps to reduce overfitting but does not achieve the same level of predictive accuracy.
set.seed(101)
rf_model <- randomForest(hs_zbmi_who ~ . , data = train_data, ntree = 500)
importance(rf_model)
## IncNodePurity
## hs_child_age_None 4.7927709
## e3_sex_None 0.5707810
## e3_yearbir_None 5.3687139
## hs_cd_c_Log2 4.8057550
## hs_co_c_Log2 5.5171622
## hs_cs_c_Log2 5.0356667
## hs_cu_c_Log2 12.4603237
## hs_hg_c_Log2 6.5609155
## hs_mo_c_Log2 9.6711331
## hs_pb_c_Log2 6.0711093
## hs_dde_cadj_Log2 13.2357957
## hs_pcb153_cadj_Log2 42.5218764
## hs_pcb170_cadj_Log2 78.6536355
## hs_dep_cadj_Log2 5.7451267
## hs_pbde153_cadj_Log2 28.0252092
## hs_pfhxs_c_Log2 5.4200119
## hs_pfoa_c_Log2 9.0245854
## hs_pfos_c_Log2 6.9947002
## hs_prpa_cadj_Log2 5.0401167
## hs_mbzp_cadj_Log2 4.4169868
## hs_mibp_cadj_Log2 4.0028339
## hs_mnbp_cadj_Log2 5.0996136
## h_bfdur_Ter 2.8968599
## hs_bakery_prod_Ter 2.9013737
## hs_dairy_Ter 1.2216138
## hs_fastfood_Ter 0.9580573
## hs_org_food_Ter 1.1570809
## hs_readymade_Ter 1.4572844
## hs_total_bread_Ter 1.1592172
## hs_total_fish_Ter 1.4961459
## hs_total_fruits_Ter 1.1767595
## hs_total_lipids_Ter 1.3061318
## hs_total_potatoes_Ter 1.2547008
## hs_total_sweets_Ter 0.9540810
## hs_total_veg_Ter 1.0142937
## metab_1 4.5011841
## metab_2 4.1531041
## metab_3 3.4379265
## metab_4 5.7831834
## metab_5 2.9178829
## metab_6 9.2084145
## metab_7 4.5028527
## metab_8 36.1613265
## metab_9 2.8921966
## metab_10 3.3253911
## metab_11 4.0194203
## metab_12 3.3756205
## metab_13 4.1227190
## metab_14 5.2837840
## metab_15 4.4314031
## metab_16 2.6462109
## metab_17 2.4025892
## metab_18 3.0563512
## metab_19 2.2951339
## metab_20 3.9067077
## metab_21 2.8607845
## metab_22 2.6869075
## metab_23 2.8972758
## metab_24 3.7219939
## metab_25 3.4797174
## metab_26 7.4516177
## metab_27 3.3263961
## metab_28 3.6858414
## metab_29 3.3029111
## metab_30 18.5312981
## metab_31 3.4436727
## metab_32 3.0545040
## metab_33 5.2038360
## metab_34 2.4118531
## metab_35 7.7143364
## metab_36 3.8489845
## metab_37 3.5293186
## metab_38 2.5404708
## metab_39 2.5165610
## metab_40 5.1546331
## metab_41 3.8921235
## metab_42 6.1284140
## metab_43 3.4008043
## metab_44 3.5698540
## metab_45 3.6758543
## metab_46 5.1802108
## metab_47 6.2120208
## metab_48 11.7806465
## metab_49 33.0932505
## metab_50 9.9343588
## metab_51 6.2231737
## metab_52 3.3042007
## metab_53 5.1286048
## metab_54 4.9482602
## metab_55 7.9980726
## metab_56 4.6904911
## metab_57 4.7919533
## metab_58 3.2769554
## metab_59 5.7319035
## metab_60 3.9192335
## metab_61 3.0283668
## metab_62 4.3267745
## metab_63 4.8698592
## metab_64 3.9307501
## metab_65 3.8785658
## metab_66 2.5032525
## metab_67 2.7330496
## metab_68 3.5029673
## metab_69 2.4526201
## metab_70 3.2796735
## metab_71 4.6222410
## metab_72 3.4044271
## metab_73 3.0460977
## metab_74 2.7018733
## metab_75 3.8889307
## metab_76 2.8113720
## metab_77 4.7832198
## metab_78 4.9096271
## metab_79 4.0912870
## metab_80 3.5334220
## metab_81 3.3897748
## metab_82 4.6950398
## metab_83 3.3638699
## metab_84 3.0885189
## metab_85 4.6423208
## metab_86 3.2104952
## metab_87 3.8282082
## metab_88 2.8852207
## metab_89 2.7940235
## metab_90 2.6775937
## metab_91 3.2999703
## metab_92 3.4644294
## metab_93 2.8208312
## metab_94 8.2011811
## metab_95 51.6744901
## metab_96 8.1118854
## metab_97 3.0816526
## metab_98 3.2655000
## metab_99 4.4506345
## metab_100 3.9391564
## metab_101 3.6374127
## metab_102 5.7070910
## metab_103 4.1270357
## metab_104 5.6319496
## metab_105 3.3000728
## metab_106 3.4540736
## metab_107 3.8068032
## metab_108 4.0689264
## metab_109 5.3846332
## metab_110 7.8695016
## metab_111 2.7218197
## metab_112 2.7613608
## metab_113 5.5357192
## metab_114 4.3740721
## metab_115 4.3476546
## metab_116 6.1735729
## metab_117 7.0544259
## metab_118 2.4601605
## metab_119 5.4450427
## metab_120 7.1242577
## metab_121 3.7051871
## metab_122 6.2282665
## metab_123 3.2628782
## metab_124 3.3382452
## metab_125 3.4201616
## metab_126 2.4057902
## metab_127 5.0861475
## metab_128 6.4178268
## metab_129 4.2203870
## metab_130 3.4512521
## metab_131 3.2548319
## metab_132 3.2654363
## metab_133 2.6708217
## metab_134 3.2328840
## metab_135 5.2587159
## metab_136 5.0406508
## metab_137 6.3400908
## metab_138 4.7455979
## metab_139 3.7303384
## metab_140 2.7325221
## metab_141 7.9353350
## metab_142 13.5320442
## metab_143 9.6865623
## metab_144 3.9033383
## metab_145 4.2785181
## metab_146 4.2770667
## metab_147 3.5426166
## metab_148 3.3025573
## metab_149 5.6260285
## metab_150 4.9789354
## metab_151 3.5939464
## metab_152 4.7580724
## metab_153 4.3991465
## metab_154 5.0953881
## metab_155 2.5069093
## metab_156 2.8819060
## metab_157 4.3767383
## metab_158 3.9225685
## metab_159 3.3643529
## metab_160 7.0114089
## metab_161 28.1188870
## metab_162 3.9630717
## metab_163 14.4399347
## metab_164 7.1747718
## metab_165 3.9310471
## metab_166 3.8429719
## metab_167 3.3589423
## metab_168 3.1719266
## metab_169 4.2720474
## metab_170 4.2149550
## metab_171 4.9419993
## metab_172 3.7296383
## metab_173 4.0956131
## metab_174 4.6193564
## metab_175 3.7661891
## metab_176 6.2453467
## metab_177 14.2757880
par(mar = c(6, 14, 4, 4) + 0.1)
varImpPlot(rf_model, cex = 0.6)
rf_predictions <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_predictions - y_test)^2)
rmse_rf <- sqrt(rf_mse)
cat("Diet + Chemical + Metabolomic + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest RMSE: 1.004679
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.004854
Random Forest performs better than the decision trees but worse than the penalized regression models. The consistent RMSE between the test and CV indicates stable performance. Random Forests are robust but may not capture the data’s underlying patterns as effectively as the penalized regression models in this context.
gbm_model <- gbm(hs_zbmi_who ~ ., data = train_data,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
n.minobsinnode = 10,
shrinkage = 0.01,
cv.folds = 5,
verbose = TRUE)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.4860 nan 0.0100 0.0031
## 2 1.4799 nan 0.0100 0.0042
## 3 1.4751 nan 0.0100 0.0037
## 4 1.4720 nan 0.0100 0.0007
## 5 1.4668 nan 0.0100 0.0043
## 6 1.4622 nan 0.0100 0.0017
## 7 1.4565 nan 0.0100 0.0049
## 8 1.4519 nan 0.0100 0.0038
## 9 1.4470 nan 0.0100 0.0034
## 10 1.4429 nan 0.0100 0.0026
## 20 1.3973 nan 0.0100 0.0033
## 40 1.3199 nan 0.0100 0.0026
## 60 1.2519 nan 0.0100 0.0015
## 80 1.1922 nan 0.0100 0.0020
## 100 1.1425 nan 0.0100 0.0008
## 120 1.0943 nan 0.0100 0.0013
## 140 1.0515 nan 0.0100 0.0003
## 160 1.0125 nan 0.0100 0.0003
## 180 0.9736 nan 0.0100 0.0003
## 200 0.9397 nan 0.0100 0.0008
## 220 0.9070 nan 0.0100 0.0012
## 240 0.8800 nan 0.0100 0.0009
## 260 0.8535 nan 0.0100 -0.0001
## 280 0.8275 nan 0.0100 0.0005
## 300 0.8029 nan 0.0100 0.0002
## 320 0.7809 nan 0.0100 0.0003
## 340 0.7594 nan 0.0100 0.0003
## 360 0.7400 nan 0.0100 0.0000
## 380 0.7220 nan 0.0100 0.0000
## 400 0.7042 nan 0.0100 0.0003
## 420 0.6874 nan 0.0100 0.0002
## 440 0.6711 nan 0.0100 0.0003
## 460 0.6552 nan 0.0100 0.0000
## 480 0.6392 nan 0.0100 -0.0001
## 500 0.6237 nan 0.0100 -0.0002
## 520 0.6107 nan 0.0100 0.0000
## 540 0.5974 nan 0.0100 0.0000
## 560 0.5852 nan 0.0100 -0.0002
## 580 0.5728 nan 0.0100 0.0001
## 600 0.5612 nan 0.0100 -0.0001
## 620 0.5500 nan 0.0100 -0.0004
## 640 0.5386 nan 0.0100 0.0002
## 660 0.5285 nan 0.0100 0.0000
## 680 0.5187 nan 0.0100 -0.0001
## 700 0.5086 nan 0.0100 -0.0001
## 720 0.4995 nan 0.0100 -0.0001
## 740 0.4900 nan 0.0100 0.0001
## 760 0.4813 nan 0.0100 -0.0001
## 780 0.4729 nan 0.0100 -0.0000
## 800 0.4645 nan 0.0100 -0.0001
## 820 0.4565 nan 0.0100 -0.0002
## 840 0.4491 nan 0.0100 -0.0001
## 860 0.4409 nan 0.0100 0.0000
## 880 0.4336 nan 0.0100 -0.0002
## 900 0.4267 nan 0.0100 -0.0001
## 920 0.4197 nan 0.0100 -0.0001
## 940 0.4133 nan 0.0100 -0.0001
## 960 0.4067 nan 0.0100 -0.0000
## 980 0.4007 nan 0.0100 -0.0002
## 1000 0.3943 nan 0.0100 -0.0001
summary(gbm_model)
# finding the best number of trees based on cross-validation
best_trees <- gbm.perf(gbm_model, method = "cv")
predictions_gbm <- predict(gbm_model, test_data, n.trees = best_trees)
mse_gbm <- mean((y_test - predictions_gbm)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Chemical + Metabolomic + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM RMSE: 0.9537
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 0.9526345
GBM performs well with RMSE values close to the penalized regression models, particularly Group Elastic Net and Ridge. GBM’s iterative boosting process helps improve predictive accuracy, but it still falls slightly short of the performance of Group Lasso.
Model Comparison
Group Lasso has the lowest RMSE values, indicating the best predictive performance and effective variable selection. Group Ridge, Group Elastic Net, and GBM also perform well, with RMSE values close to Group Lasso. These models offer a balance between prediction accuracy and interpretability. Decision Trees and Random Forests are less effective for this particular dataset and problem.
results <- data.frame(
Model = rep(c("Lasso", "Ridge", "Elastic Net", "Decision Tree", "Random Forest", "GBM"), each = 5),
Data_Set = rep(c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"), 6),
RMSE = c(1.152, 1.139, 1.041, 1.131, 0.875,
1.149, 1.129, 1.035, 1.033, 0.885,
1.152, 1.139, 1.035, 1.032, 0.885,
1.155, 1.149, 1.090, 1.090, 1.128,
1.155, 1.129, 1.032, 1.026, 1.005,
1.150, 1.136, 1.040, 1.048, 0.953)
)
results$Data_Set <- factor(results$Data_Set, levels = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"))
rmse_plot <- ggplot(results, aes(x = Data_Set, y = RMSE, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(rmse_plot)
# filter results for Lasso model
lasso_results <- subset(results, Model == "Lasso")
rmse_lasso_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Lasso Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(rmse_lasso_plot)
# filter results for ridge model
ridge_results <- subset(results, Model == "Ridge")
rmse_ridge_plot <- ggplot(ridge_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Ridge Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(rmse_ridge_plot)
# filter results for elastic net model
enet_results <- subset(results, Model == "Elastic Net")
rmse_enet_plot <- ggplot(enet_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Elastic Net Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(rmse_enet_plot)
# filter results for decision tree model
tree_results <- subset(results, Model == "Decision Tree")
rmse_tree_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Decision Tree Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(rmse_tree_plot)
# filter results for random forest model
rf_results <- subset(results, Model == "Random Forest")
rmse_rf_plot <- ggplot(rf_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Random Forest Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(rmse_rf_plot)
# filter results for elastic net model
gbm_results <- subset(results, Model == "GBM")
rmse_gbm_plot <- ggplot(gbm_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "GBM Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(rmse_gbm_plot)
Group Lasso, Ridge, and Elastic Net Models:
Consistently performed well across different variable sets, with the lowest RMSEs observed in the full model including diet, chemical, metabolomic, and covariates.
The addition of metabolomic data significantly improved model performance, suggesting their strong predictive power in relation to BMI.
Decision Trees:
Showed moderate performance, with pruning improving RMSE values.
The full decision tree models were less effective compared to regularized regression models.
Random Forest and GBM:
Demonstrated robust performance, with GBM slightly outperforming Random Forest.
The inclusion of diverse data types (diet, chemical, metabolomic) led to improved RMSE values, highlighting the benefit of a comprehensive data approach.
Overall, a combined approach leveraging demographic, dietary, chemical, and metabolomic data provides the most accurate predictions for BMI, with regularized regression techniques (Lasso, Ridge, Elastic Net) and ensemble methods (Random Forest, GBM) offering the best performance.